REPORT APPENDIX ; 


AN EMPIRICAL EDUCATION REPORT APPENDIX 


Final Report of the 13 Impact Study of Making Sense 
of SCIENCE 


2016-17 THROUGH 2017-18 
APPENDIX 


November 18, 2020 


Andrew P. Jaciw, Co-principal Investigator 
Thanh Nguyen, Co-principal Investigator 
Li Lin 

Jenna L. Zacamy 

Connie Kwong 

Sze-Shun Lau 


Empirical Education Inc. 


Pe mp irical 
Ret ppirica’ 


Empirical Education Inc. - 2955 Campus Dr #110, San Mateo, CA 94403 - empiricaleducation.com - @empiricaled 


FIND THE FINAL REPORT AT HTTPS:/AWWW.EMPIRICALEDUCATION.COM/MSS/ 


Table of Contents 

Appendix A. Making Sense of SCIENCE Logic Model Terminology and Definitions ......0.0. cee 1 
Appendix B. Survey Scales of Teacher Attitudes and Beliefs, Opportunities to Learn, and School 
RA cee canta canst tec satpsatay ratte cod caer ocaapecetecdeohrepcecaataeeceaantadectalnchadeerabe ct tacranreni gs nar aearamseur a seeeens 11 
Appendix C. Teacher Content Knowledge Assessment: Item-Level Information. ..........cccccsseseeeseseeees 19 


Appendix D. Assessment of Student Science Achievement: Construction, Forms, Administration, 


DST, seats AR OAC iis Oe acs caanriceentncatenaes trainee parent dcumeameemnaumtnaraummy soni 22 
Appendix E. All Student Survey Scales Measuring Opportunity to Learn and Non-Academic 

(© 10) cole) a1 (2 aren PMT RneTTT erecta cert mer eae tree erst reer meer er renee nore reer Teer en rrr te crt ner rrr 36 
Appendix Descripion.or the F rine pall SULVEYS ycancnmnxcaiecmenoasnannnonmennanniemsieelaaiaammensiomanaet 40 
AppendixG, Description of the Pilot Of the Video arid AlIdiO RECOPCINGS scssssticawisiessicanessatisaaietetetdsascinian 41 
Appendix H. Hierarchical Linear Model for the Analysis of Impact on Teacher Content 

IO WIC e ssscssistassinssshentsivtentwonsenisannun snicabsdatpanepbababonimpnscesusdionantionnieisnstenestanodunsnitondtesoansndlaynavasadiaantouleaninelvaaetowines 44 
Appendix I. Detailed Results of the Benchmark Analysis of Impacts on Teacher Content 

WRU WCAG ists seianin's cee evs vv eee seslsic csc ssganootonve dentine dasa baobab sion nad anmap cabana Ua wa nionnndblehausleduaeymuas 45 
Appendix J. Sensitivity Analysis for the Analysis of Impact on Intermediate Outcomes ...........:..:0000 48 


Appendix K. Hierarchical Linear Model Associated with the Confirmatory Impacts on Student 
Deienice A Ciileyement (Selected - Resp Olice Mes) tascstcrcupcectonisercoclichsspctutictaduieatesdrehdreaucanaiman und De 


Appendix L. Full Estimates of the Benchmark Impact Model for the Confirmatory Analysis of 
Impacts on Student Science Achievement (Selected-Response Item) ........ccssssssscsesseesssscssesetsseesseseeees 54 


Appendix M. Sensitivity Analyses for the Confirmatory Impacts on Student Science Achievement 
(SelSCIS A= Response [SAS ) sasisiussveusiséssaresssdapiiiosts sianuniapasanstiannpcnsunueatonus suiteasdiwapdainseasntalaseiam aa ednicuaacaaiarniesuetes 56 


Appendix N. Full Estimates of the Benchmark Impact Model for the Confirmatory Analysis of 
Impacts on Science Achievement of Students in the Lowest Third of Incoming Achievement ............ Do 


Appendix O. Sensitivity Analyses for the Confirmatory Impacts on Science Achievement of 
students in the Lowest Third of Incoming Achievement siissssiscinsthowiasesscrcannsacannganlounendiesuveaniaiainenenses 63 


Appendix P. Sample Sizes and Baseline Equivalence for the Impact on Student Science 
Achievement for SPECiie SUDS ATA Pl! sis ssasissses coves atarsnssddndensoasnsndsnann se cuasscndnanonanvennaktndkeooaronebvonimnabweauiknmets 66 


Appendix Q. Supplemental Analysis on the Impact on Student Science Achievement under High 


PCS Ol MOUS WOM ircaaraureae vdestuatsen tears cup tnnemuyie apnea taatea tna nan tn mnemnenoetaentaaNsamansaem at 70 
Appendix R. Detailed Impact Analysis Findings for the Constructed Response Item ............:cs0s00 72 
PICT CHC es anil nathan nse ee aannenna nea ean ene eee ara OER 80 


© 2020 EMPIRICAL EDUCATION INC 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Appendix A. Making Sense of SCIENCE Logic Model Terminology and Definitions 


This appendix provides the key terminologies that were used in the Making Sense of SCIENCE logic 


model and their definitions, as supported by extant literature. This appendix was written and 


provided by the WestEd program team. 


LOGIC MODEL CONSTRUCT DESCRIPTIONS 


Leadership Outcomes 


We posit that the Making Sense of SCIENCE Leadership Development component has a direct impact 


on state, regional, district, school, and teacher leaders and has more distal outcomes on school culture 


and teacher attitudes and beliefs. 


State and Regional Leader Outcomes 


Deeper knowledge of standards implementation (e.g., NGSS) — State and regional leaders with 
greater knowledge of reform-based standards and best practices associated with standards 
implementation are better equipped to build an infrastructure for developing and sustaining 
improvements for science education in the long term (Penuel, et al., 2014). 


Greater ability to support implementation of school/district professional learning — With technical 
assistance, state and regional leaders are able to set priorities and adequately align resources to 
support professional learning, and science teaching and learning. 


Administrator Outcomes 


We posit that Making Sense of SCIENCE has an impact on school principals, coaches, and district 


administrators. 


Deeper knowledge of instructional shifts in science and standards implementation supports — The 
literature suggests that when administrators have deeper understanding of reform-based 
standards and the instructional shifts required, they have a better understanding of how 
demanding this work is and the kinds of supports their administrators and teachers need. 
Subsequently, they are able to provide the appropriate supports for standards implementation 
(Iveland, et al., 2017). 


Shifted beliefs that learning science is as important as other subjects — Teachers often cite that the 
biggest barrier to teaching science is time, due to the demands to meet accountability 
requirements for math, reading, and writing. When administrators shift their belief that 
science is also an important subject, they are able to signal to teachers that science is a priority 
and allocate more time and resources to support science teaching and learning. 


Increased philosophical alignment with standards — When administrators understand the 
instructional shifts required by reform-based standards, they have a better understanding of 
what that will look like in the classroom and will give teachers permission to grow and fail as 
they try to incorporate these instructional shifts in their classroom. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 1 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Teacher Leaders 


Making Sense of SCIENCE grows the leadership capacities of teachers through professional learning 
and coaching that strengthens the knowledge and skills needed to be effective in their own 
classrooms. We intentionally build the skills and confidence of teacher leaders to facilitate the 
professional growth of their peers. We posit that MSS has an impact on the skills and knowledge of 
teacher leaders. 


e Deeper knowledge of standards implementation (e.g., NGSS) — When teacher leaders have greater 
knowledge of standards, they are able to take on the role of a curriculum specialist and can 
serve as a Catalyst of change to bring about the implementation of science standards in a 
school (Harrison & Killion, 2007). 


e Greater skill in facilitating teacher learning and collaboration — When teacher leaders develop their 
content and pedagogical content knowledge through Making Sense of SCIENCE courses, and 
also develop their facilitation skills through the Making Sense of SCIENCE Facilitation 
Academies, they are able to facilitate communities of learning through school-wide approved 
processes, particularly professional learning communities (PLCs). In PLCs, when teachers 
learn with and from one another, they can focus on what most directly improves student 
learning (Harrison & Killion, 2007). 


Teacher Outcomes 


We posit that the MSS Teacher Professional Learning component has a direct impact on teachers’ 
content knowledge and pedagogical content knowledge, and may have a distal impact on teacher 
attitudes and beliefs. 


Content Knowledge 


Teacher content knowledge is used to describe the body of knowledge that teachers teach and that 
students are expected to learn in a content area. The focus on teacher content knowledge is aligned 
with the literature that provides clear evidence on the critical role that teacher content knowledge 
plays in raising student achievement (Hill et al., 2005; Kanter & Konstantopoulos, 2010). 


Pedagogical Content Knowledge 


Pedagogical Content Knowledge (PCK) is used to describe the knowledge that teachers use to 
transform particular subject matter for student learning. We are guided by the definition of PCK as 
identified in the Revised Consensus Model (RCM) of PCK (Carlson & Daehler, 2019). This model 
identifies three distinct realms of knowledge that teachers have that ultimately mediate student 
outcomes: 1) collective PCK, which is described as the specialized professional knowledge held by 
educators in the field; 2) personal PCK, which is the cumulative and procedural pedagogical content 
knowledge and skills of an individual teacher; and 3) enacted PCK, which refers to a teacher’s practice 
of engaging with teaching during planning, instruction, and reflection on instruction and student 
outcomes. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 2 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Attitudes and Beliefs 


Attitudes and beliefs are amplifiers to how teachers develop their personal PCK. The literature on 


attitudes and beliefs documents the connection between 1) teacher attitudes and beliefs, and 2) 


teachers’ thought process, classroom practices, change, and pedagogical practices used to teach 
(Porter and Freeman, 1985, as cited in Pajares, 1992). Additionally, attitudes and beliefs shape the 
way teachers react, and choose to respond to reforms (Jones & Carter, 2013). We hypothesize that 


with the Making Sense of SCIENCE professional learning courses and PLCs, we can expect to see 


some shifts in teachers’ implicit knowledge and beliefs about students, teaching as identified by the 


constructs below. The first three constructs below are explicitly supported by Making Sense of 


SCIENCE professional learning, and the remaining constructs posit more distant expected teacher 


outcomes. 


Belief that students are capable learners — Literature suggests that teacher expectations of student 
abilities and the changeability of student abilities and their potential interacts with their 
behavior in the classroom. The National Science Education Standards (National Research 
Council, 1996) are based on five key assumptions, one of which is “Actions of teachers are 
deeply influenced by their understanding of and relationships with students.” 


Philosophically aligned with standards — For teachers to be able to make the shifts required by 
three-dimensional science standards, they need to develop themselves, and understand that 
students need deeper understanding of science and engineering content through making sense 
of phenomena and designing solutions. Students also need opportunities to integrate science 
content and practices. In order for teachers to guide students to making sense of phenomena, 
teachers need to see a) their role shift to being a facilitator in learning, and b) students taking 
on the process of learning like scientists and engineers through active exploration and sense- 
making processes. 


Values being a reflective practitioner — Science teachers need to engage in reflective practices to 
assess their teaching of science as promoted by the National Science Education Standards: 
Teachers of science engage in ongoing assessment of their teaching of student learning. In 
doing this, teachers use student data, observations of teaching, and interactions with 
colleagues to reflect on and improve teaching practice (National Research Council, 1996). 
Making Sense of SCIENCE supports teacher reflective cycles of their practice by examining 
student data and interacting with other teachers through PLCs. 


Confidence — As elementary science teachers often express severe lack of confidence in science 
teaching (Murphy, et al., 2007), science professional learning is hypothesized to impact teacher 
confidence to: 


o teach science; 
o teach with science instructional practices; and 


o support literacy. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX a 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


e Self-efficacy — When teachers become knowledgeable about a particular subject, it increases 
their belief in his or her capability to organize and execute courses of action required to 
successfully accomplish a specific teaching task in a particular context (Tschannen-Moran et 
al., 1998). Teachers who have higher self-efficacy, content knowledge, and attitudes have 
students with higher achievement than do teachers who have lower levels of self-efficacy 
(Evans, 2011). 


e Agency in the classroom — Teachers who are given autonomy to teach science by supportive 
districts and administrators have the capacity to act intentionally in setting instructional goals 
in their classrooms (Calvert, 2016). 


e Agency in science leadership — The capacity of teachers to act purposefully to direct their 
professional growth and contribute to the growth of their colleagues depends on a teacher’s 
internal traits and supportive structural conditions that support professional learning (Calvert, 
2016). 


e Professional aspirations — The extent to which teachers stay in a school or school system as 
teachers and their aspirations to pursue leadership positions are influenced by school cultures 
that value and respect teachers and develop teaching and leadership expertise (Cameron & 
Lovett, 2015), 


School Climate 


We posit that when Making Sense of SCIENCE works with state and regional coordinators, districts, 
and school principals through our partnership and leadership development offerings, we can see 
positive changes that trickle down to create a positive district and school climate that is conducive for 
science teaching and learning. 


District Support 


An essential element of reform in science education is district support for science teaching and 
learning. We posit that Making Sense of SCIENCE contributes to the following improvements at the 
district level. 


e Providing guidelines on science instruction — District guidelines that outline the expected shifts 
to happen in elementary science including developing curriculum frameworks, evaluation 
criteria for instructional materials in science, and outlining the scope and sequence for science 
teaching science. 


e Allocating resources for professional learning in science — District guidelines that allocate coaching 
resources, professional learning time, teacher pay, or substitute time for science professional 
learning. 


e Allocating resources for science materials — District leaders make investments and allocate 
resources to purchase science curriculum, instructional materials, laboratory equipment, and 
technology supports for science teaching and learning. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 4 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Prioritizing support for science learning — Superintendents and other district leaders signal the 
importance of science education by outlining guidelines for time on science and putting 
science on the agenda for administrators to take to their school sites. 


Participating in science-related conversations/activities — Superintendents and other district 
leaders actively attend meetings to get informed on standards-based science implementation 
and take part in discussions that shape science education. 


Involving teachers in district science decisions — Superintendents and district leaders actively 
invite teachers to create or provide input on science standards-based implementation in Local 
Control and Accountability Plans and involve them in the selection of instructional materials. 


Building capacity for science professional learning — Superintendents and other district leaders 
invest in building leadership capacity and material support for teacher professional learning in 
science. 


Administrative Support 


We posit that Making Sense of SCIENCE contributes to following improvements at the administrator 


level. 


Providing science resources and supplies — Administrators approve teacher requests and increase 
the availability of science resources and supplies in a school (Ivelandet al., 2017). 


Supporting teacher collaboration — Administrators forge the conditions that make PLCs a 
priority by changing the structure of the school day, and providing the financial support 
needed to make PLCs happen (Iveland et al., 2017). 


Acting as an instructional leader — School principals can play an important role as instructional 
leaders when they spend time to support and collaborate with other teachers on science 
content and instruction. When administrators participate in professional learning alongside 
teachers, they are more likely to support and compel teachers to improve their practice and to 
learn new skills Jenkins, 2009; Casey et al., 2012). 


Prioritizing support for science learning and teacher professional learning — When school principals 
and administrators participate in professional learning alongside teachers, they are more likely 
to allocate more time for science instruction, extracurricular science activities, and teacher 
collaboration (Iveland et al., 2017) and allocate time and resources for teacher professional 
learning in science. 


Involving teachers in science leadership — Principal actions and the relationship amongst adults 
in a school are determining factors in developing sustaining science leadership, particularly 
among teachers. When principals empower teachers to take on additional science leadership 
responsibilities, teachers are able to lead within and beyond the classroom, identify with and 
contribute to a community of teacher learners, and influence others towards improved 
instructional practice (Katzenmeyer & Moller, 2009). 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 5 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Collaboration 


Sustained, job-embedded, and collaborative teacher learning can occur in PLCs. In PLCs, teachers 
collaborate and work together in continual dialogues to examine their practice and student 
performance and to develop and implement more effective instructional practices (Darling- 
Hammond & Richardson, 2009). We posit that when teachers and administrators participate in 
Making Sense of SCIENCE, it contributes to improvement in the following areas. 


e Teacher-to-teacher collaboration — When teachers collaborate in functional PLCs, they allow for 
teachers to take risks in teaching and changes in instruction that are reform-oriented and 
student-centered (Briscoe & Peters, 1997; Brahier & Schaffner, 2004). The expected changes 
associated with PLCs result from increases in the amount of time allocated for collaboration in 
PLCs and the type of substantive activities that teachers engage in the PLCs around content 
learning and instruction (Graham, 2007). 


e Administrator-to-teacher collaboration — Similar to collaboration between peers, when 
administrators gradually take on the role of instructional leaders and increase the amount and 
type of substantive activities in which they collaborate with teachers in PLCs, teachers become 
encouraged to take on risks and change their instruction towards reform-oriented practices 
(Urick et al., 2018). 


School Culture 

A positive school culture promotes cooperative learning, group cohesion, respect, and mutual trust 
which can directly improve a school’s learning environment (Thapa et al., 2013). With the two- 
pronged approach of Making Sense of SCIENCE in providing teacher and leadership professional 
learning, we posit to see improvements at the distal level at the school level. 


e Learning climate — Schools experience changes towards a conducive learning climate that is 
student-centered and endorses ambitious academic work coupled with adequate support for 
all students (Bryk, 2010). 


e Trust and respect among peers and among peers and administrators — In schools where teachers 
and teachers and administrators increasingly trust and respect each other, learning becomes 
conducive for both teachers and students. Principals supportive of science as a priority play a 
critical role in influencing the levels of trust and respect between teachers (Hallam et al., 2015). 


Opportunity to Learn Science in the Classroom 

Opportunity to learn is a multi-dimensional construct central to quality teaching and a prerequisite to 
student achievement (Elliott & Bartlett, 2016). It is composed of the amount of instructional time on 
science and the content taught; instructional quality of science that reflects the shifts in three- 
dimensional science standards. Conducive classroom cultures also facilitate student-centered 
learning of science. 


We posit that students with Making Sense of SCIENCE teachers are likely to see the following 
changes in their opportunities to learn science in the classroom. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 6 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Time on Science 


Measures of instructional time in literature have been grouped into four ranges — years, days, hours, 
and minutes (Frederick & Walberg 1980). In our definition of instructional time, we define time on 
science by the time allocated to science in minutes and the frequency in which it is taught during the 
week. 


e Amount of time on science in minutes of science learning per week and integrated science- 
literacy time 


e Frequency of the amount of time science is taught per week and per year 


The types of tasks and activities teachers use to engage students in science look different in an NGSS- 
aligned classroom (Tekkumru-Kisa et al., 2020). Consequently, we hypothesize to see teachers’ shifts 
in the types of instructional tasks assigned and the content students engage with that are aligned 
with three-dimensional learning as listed under the Instructional Changes and Content of Science 
Taught sections below. 


Instructional Changes 


e Sense-making of hands-on investigations — Sensemaking is the process that students and teachers 
undertake to think together and to make sense of what things mean. When students conduct 
investigations and produce and/or come up with data, they work together to analyze this data 
by looking for patterns and relationships to develop explanations and models (McNeill et al., 
2015) 


e Engaging in scientific argumentation — Our definition is aligned with the National Research 
Council (2013), which outlines that when students engage in scientific argumentation, they are 
expected to listen to, compare, and evaluate competing ideas and methods based on their 
merits. 


e Explaining ideas and phenomena — Phenomena are events that are observable and repeatable 
and can be explained or predicted using science knowledge. The instructional shifts required 
by three-dimensional standards use phenomena as the starting point for learning. Students are 
taught to develop ideas, based on evidence, to explain phenomena (Achieve, 2017). 


e Integration of science and literacy — Literacy is the ability to read, write, and engage with 
scientific texts because when students engage in these activities, they are able to deepen their 
conceptual understanding of science (Cervetti et al., 2012). 


e Integration of science and mathematics — Our definition is aligned with the National Research 
Council (2013), which outlines that the integration of mathematics is fundamental in providing 
students with opportunities to engage in a range of tasks such as constructing simulations; 
statistically analyzing data; and recognizing, expressing, and applying quantitative 
relationships. 


e Participating in collaborative discourse — When teachers create classroom discourse structures, 
they enable both the students and the teacher to engage in collaborative knowledge building. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX y 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


These collaborative processes also use discourse structures that move away from the IRE usual 
mode of classroom discourse, in which the teachers follow the pattern of initiating, responding 
to, and evaluating (IRE) responses (Hmelo-Silver & Barrows, 2008). 


e Reflecting on learning — Metacognitive inquiries and formative assessment practices are 
powerful learning tools. Metacognition is defined in terms of student understanding of their 
processes of learning, in terms of how and what they learn. When students engage in 
metacognitive discourse, they engage in the process of making explicit their tacit reasoning 
and problem-solving strategies (Greenleaf et al., 2011). Formative assessments also help 
students understand their learning, but it is important for both teachers and students to reflect 
on what they learn about student understanding. 


e Participating in cognitively challenging tasks — Cognitively challenging tasks refer to the depth of 
student engagement in conceptual thinking. We are guided by the definition of Elliott & 
Bartlett (2016) which prescribes that teachers must dedicate instructional time to addressing a 
range of cognitive processes, instructional practices, and grouping formats when covering 
content. 


Content of Science Taught 


e Standards-aligned science concepts in science, life, and physical science disciplinary core ideas 
are taught with breadth and depth. 


e Science practices of developing and using models, arguing from evidence, constructing 
explanations, and analyzing data and representations of data are taught with breadth and 
depth. 


e Cross-cutting concepts of cause and effect, energy and matter, and systems and systems 
models are taught with breadth and depth — The identified cross-cutting concepts provide the 
connections and tools to understand the science concepts taught in Making Sense of SCIENCE 
in science, life, and physical sciences. 


e Literacy skills related to science-specific ways of reading, writing, and engaging in scientific 
discourse are taught in breadth and depth — According to the National Science Education 
Standards, scientific literacy means that a person can ask questions, and is able to read about, 
describe, explain, and write about natural phenomena. For students to acquire science-specific 
literacy skills, they need to learn and observe how to read, write, and engage in discourse 
using science-specific conventions that model how scientists work every day (NRC, 2013; 
Wright et. al., 2016). 


Conducive Classroom Cultures 

Classroom culture is influenced by teacher attitudes and approaches, and teacher participation in 
professional learning is linked to investigative classroom cultures (Supovitz & Turner 2000). We posit 
that students with Making Sense of SCIENCE teachers will show improvements in the following 
characteristics of conducive classroom cultures. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 8 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Student-centered learning — Student-centered classrooms are characterized by teachers who 
know and communicate that they do not need to know everything and who value student 
ideas in making sense of phenomena. Student-centered classrooms are also characterized by 
sustained engagement with student questions and ideas. These classrooms are characterized 
by a safe classroom culture, in which students and teachers celebrate risk taking in learning. 


Student agency — We define agency as students’ choice and capacity to take responsibility for 
their own learning. Classroom cultures also promote student agency when students feel that 
they can share ideas without being held up for ridicule and recognize the dialogical 
opportunities available when they consider and value each other’s ideas in the process of 
learning (Cavagnetto et al., 2020). 


High expectations of students — When teachers raise their expectations and increase the rigor of 
their instructions, they facilitate student learning. Teacher expectations also contribute to the 
whole-class teaching environment through grouping choices, a continuum of cognitively 
challenging tasks, and student agency in what they learn (Rubie, 2009). 


Environment conducive to learning with appropriate classroom management — Classroom 
management plays an important role in creating a safe and conducive learning environment 
for learning science. Making Sense of SCIENCE staff model how inquiry-based learning can be 
facilitated in the classroom during teacher professional learning. 


Active student engagement — When students are actively engaged in a classroom, they are seen 
participating in discussion and showing understanding of the purpose of the lesson goals. 


Student Achievement and Attitudes 

We posit that Making Sense of SCIENCE has a distal impact on student achievement, student 
attitudes, and dispositions towards science. Specifically, we hypothesize seeing improvements in the 
following student outcomes. 


Science and English Language Arts Achievement 


Science knowledge — improved student science content knowledge in earth and physical 
sciences 


Science practices —practices used by scientists as they investigate models and build theories 
about the world (National Research Council, 2012). We are particularly interested in looking at 
how Making Sense of SCIENCE improves student skill with developing and using models; 
arguing from evidence; constructing explanations; and analyzing data and representations of 
data. 


Communicating science ideas —reading and writing are essential skills in science (National 
Research Council, 2012). Making Sense of SCIENCE improves student skills in communicating 
science ideas through writing and sustained productive scientific discourse. 


English Language Arts Achievement — improved student achievement in reading, writing, 
speaking and listening 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 9 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Student Attitudes 


e Aspirations — improved student dispositions and attitudes towards science and the 
development of an interest in pursuing a career in science or science related work (Tytler & 
Osborne, 2012) 


e Self-efficacy — students who judge themselves to be efficacious in science and their academic 
capabilities in science also foster a sense of efficacy to pursue careers in science (Bandura et al., 
2001). 


e Agency in learning —students’ choice and capacity to take responsibility for their own 
learning. Classroom cultures also promote student agency when students feel that they can 
share ideas without being held up for ridicule and recognize the dialogical opportunities 
available when they consider and value each other’s ideas in the process of learning 
(Cavagnetto et al., 2020). 


e Enjoyment of science — greater student enjoyment of science is found to be predictive of 
students’ interest in engaging further with science topics (Ainley & Ainley, 2011). 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 10 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Appendix B. Survey Scales of Teacher Attitudes and Beliefs, Opportunities to Learn, and School Climate 

This appendix provides information on the 30 intermediate outcomes analyzed and reported in Chapter 5. For each outcome, we list 
the survey items that were used to construct the outcome, the scale of the items, the data source, the number of items, the resulting 
Cronbach alphas, and the method used to create the outcome. 


TABLE B1. COMPOSITE CREATION FOR INTERMEDIATE OUTCOMES 


Scale of items Data No. of Cronbach's Method of 
Outcome on which to assess impact and items included (post recoding) source items Fe) a) composite creation 


e The majority of my students are capable of learning rigorous science even if they 
come from a challenging environment. 


e The majority of my students are capable of going to science-related careers. 5-pt; agree Tch: W 18 3 0.853 Average over items 


e Given the right supports, my low performing students typically are able to learn 
challenging science concepts. 


¢ To what extent do you think that NGSS is aligned with your pedagogical practices? 5-pt; align Tch: Spr 18 1 NA Average over items 


e | am confident | can learn science given the right support. 
e | frequently seek out information or learning opportunities to strengthen my teaching. 
e | actively seek input from colleagues to improve my teaching. o-pt; agree Tch: W 18 4 0.612 Average over items 


e It's okay if | don't feel confident in science. | can build off of my current 
understanding. 


e Analyzing and interpreting data from maps to describe patterns of Earth's features 


e Using evidence to construct an explanation relating the speed of an object to the 
energy of that object 5-pt; cont Temspris 6 0.887 Average over items 


e Developing a model of waves to describe patterns in terms of amplitude and 
wavelength and that waves can cause objects to move 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 11 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE B1. COMPOSITE CREATION FOR INTERMEDIATE OUTCOMES 


Scale of items 


Outcome on which to assess impact and items included (post recoding) 


| DF} i) 
source 


I orey j 
items 


Cronbach's 
Fe) a) 


Method of 
composite creation 


e Developing a model using an example to describe ways the geosphere, biosphere, 
hydrosphere, and/or atmosphere interact 


e Developing a model to describe that matter is made of particles too small to be seen 
e Supporting an argument that the gravitational force exerted by Earth on objects is 
directed down 


Confidence in science instructional practices 


e Teach science in engaging ways 

e Teach students to do hands-on science activities or investigations 

e Get students to use scientific terms accurately 

e Teach students to collect data 

e Teach students to represent data (e.g., graphs, images, simulations, physical models) 
e Teach students to identify evidence or data that support a claim 


e Use a variety of models (e.g., graphs, images, simulations, physical models) to 
support students’ science learning 


Have students use existing models (e.g., graphs, images, simulations, physical 


models) to explain something that has been observed apace 


Help students develop their own models to explain a phenomenon 


Help both your high and low achieving students learn challenging science 


Get students to reflect on their learning and then to revise their thinking 


Help students understand the world in terms of interacting systems 


Foster discussions among students that help them learn science 


Explicitly teach students how to have productive science conversations that are 
grounded in evidence 


e Teach science in a way that meets the NGSS expectations 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 


Teh Spr 16 


18 


0.969 


Average over items 


12 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE B1. COMPOSITE CREATION FOR INTERMEDIATE OUTCOMES 


Scale of items Data No. of Cronbach's Method of 
Outcome on which to assess impact and items included (post recoding) source items alpha composite creation 


e Assign writing tasks that help students learn science 


e Help students understand how to use reading strategies to make sense of science 
texts 
e Help students communicate science ideas in writing 


5-pt; cont Tch: Sor 18 6 0.947 Average over items 
e Explicitly teach students how to read complex informational texts that include graphs, ‘ ih aii 


diagrams, symbols, and data tables 
e Explicitly teach students how to write scientific explanations 
e Teach students to articulate clear, convincing reasons for their answers 


e | understand science concepts well enough to be effective in teaching elementary 
students. 


| am typically able to answer students' questions related to the science they are 
studying. 


| am typically able to respond effectively to students’ ideas about most science 5-pt;agree Tch:Spr18 5 0.925 Average over items 
topics. 


| am effective at explaining to students scientific reasons for outcomes of science 
experiments. 


| am skilled at identifying what science concepts my students find confusing. 


e Setting performance standards for students 

e Selecting science curriculum 

e Determining the pedagogical techniques that you use to teach science 
—_ 5-pt; intl Tch: W 18 6 0.840 Average over items 
e Choosing criteria for grading student performance 

e Amount of time science is taught 


e Pacing of science instruction 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 13 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE B1. COMPOSITE CREATION FOR INTERMEDIATE OUTCOMES 


Scale of items Data No. of Cronbach's Method of 
Outcome on which to assess impact and items included (post recoding) source items alpha composite creation 


For each survey, 


e Think about the times you spent teaching science during these past four school summed across 4 
; Number of Teh: 
weeks. Approximately, how many total hours of science did you teach per week in Cae E/W/Sor 18 1 NA weeks; Average 
those weeks? P over the three 
survey 
e Working collaboratively in small groups a 0.852 -F | 
e Engaging in hands-on science activity 5-pt; emph EW/Spr 18 3 0.767-W_ Average over items 


e Making a claim based on a hands-on activity or data 0.852 - Spr 


e Constructing a written scientific explanation for a "how" or "why" question 


e Constructing a verbal scientific explanation for a "how" or "why" question 


e Listening to and building on other peoples’ ideas Tch: 0.890 - FI ; 
- ; -_ 5-pt; emph 6 0.882-W Average over items 
e Writing to support learning from a hands-on activity F/W/Spr 18 0.891 - Spr 


e Reading to support learning from a hands-on activity 


e Discussing to support learning from a hands-on activity 


e Listening to and building on other peoples’ ideas Tch: 0.738 - FI ; 
; ; ; _ 5-pt; emph 2 0.710-W Average over items 
e Discussing to support learning from a hands-on activity F/W/Spr 18 0.724 - Spr 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 14 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE B1. COMPOSITE CREATION FOR INTERMEDIATE OUTCOMES 


Scale of items Data No. of Cronbach's Method of 
Outcome on which to assess impact and items included (post recoding) source items alpha composite creation 


e Exploring real-world phenomena Tch: 0.694 - FI ; 
; ; ; 5-pt; emph 2 0.700-W Average over items 
e Making sense of science ideas F/W/Spr 18 0.861 - Spr 


e Evidence of change in landscape over time 


e Relationship between fossils and rock layers 

e Explaining the brightness of the Sun relative to other stars 3-pt; Didteach Tch:Spr18 5 0.788 Average over items 
e Explaining day and night 

e Explaining changing positions of the Sun, moon, and stars 


e Effects of weathering 

e Factors affecting rates of erosion 

e Defining Earth's systems (e.g., atmosphere, biosphere, geosphere, hydrosphere) 

e How living things affect their physical environments 3-pt; Didteach Tch:Spr18 7 0.813 Average over items 
e Interactions affecting Earth's systems (e.g., landforms, climate, weather) 

e Interpreting maps of Earth's features (e.g., mountains, ocean trenches, volcanoes) 


e Distribution of water on Earth (e.g., oceans, glaciers, atmosphere) 


e Renewable and nonrenewable resources 
e Types of natural hazards 3-pt; Didteach Tch:Spr18 3 0.866 Average over items 


e Human impact on Earth systems 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 15 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE B1. COMPOSITE CREATION FOR INTERMEDIATE OUTCOMES 


Scale of items Data No. of Cronbach's Method of 
Outcome on which to assess impact and items included (post recoding) source items alpha composite creation 


e How unbalanced forces affect motion 

e Using patterns in motion to predict future motion 

e Contact forces between objects 3-pt; Didteach Tch:Spr18 5 0.855 Average over items 
e Factors that affect the size of electric and magnetic forces 


e Direction of gravitational force 


E iated with object ing at different d 
e Energy associated with objects moving at ditrerent speeds 3-pt; Did teach Tch: Spr 18 2 0.744 Average over items 


e Energy associated with sound, light, and electrical currents 


e Transfer of light energy 

e Transfer of electrical energy 

e Transfer of energy when objects collide 

e Change in motion when objects collide 3-pt; Didteach Tch:Spr18 7 0.874 Average over items 
e Transfer of energy from the Sun to plants to animals 

e Conversion of stored energy to other types 


e Conservation of energy 


e Defining waves 

Amplitude and length 
: ue daa ans eke 3-pt; Did teach Tch: Spr 18 4 0.917 Average over items 
e Light and observing objects 


e Transmitting digitized information 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 16 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE B1. COMPOSITE CREATION FOR INTERMEDIATE OUTCOMES 


Scale of items 
Outcome on which to assess impact and items included (post recoding) 


Data No. of Cronbach's Method of 
source items alpha composite creation 


e Particulate nature of matter 

e Identifying substances based on their properties 
ny! g j aac 3-pt; Did teach 
e Identifying chemical reactions 


e Conservation of matter 


Tech: Spr is 4 O.911 Average over items 


e Asking and defining problems 

e Developing and using models 

e Planning and carrying out investigations 
e puma and nee data | - Se Didtedel 
e Using mathematics and computational thinking 

e Constructing explanations and designing solutions 

e Engaging in argument from evidence 


e Obtaining, evaluating, and communicating information 


Tch: Spr 18 8 0.896 Average over items 


e Patterns 
e Cause and effect: mechanism and explanations 


e Scale, proportion, and quantity 


e Systems and system models 3-pt; Didteach Tch:Spr18 7 0.890 Average over items 


e Energy and matter: flows, cycles, and conservation 
e Structure and function 
e Stability and change 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE B1. COMPOSITE CREATION FOR INTERMEDIATE OUTCOMES 


Scale of items Data No. of Cronbach's Method of 
Outcome on which to assess impact and items included (post recoding) source items alpha composite creation 


e Approximately how much total time, if any, in the past four school weeks did you 
participate in informal peer collaboration for science instruction (for example, sharing Select1of4or Tch: Fl, W, 1 NA Average across 3 
lesson plans or resources, discussing student work, informally observing a colleague's 5 options Spr 18 surveys 
science lesson etc.)? 


e Collaboration happens organically among teachers. Average across 


5-pt; agree = Tch: Spr 18 2 0.707 ane 


e Teachers find peer collaboration helpful. 


e Teachers trust each other in this school. 
e Teachers regularly observe each other teaching classes. 


e It's okay to discuss feelings, worries, and frustrations with other teachers. Average across 


5-pt; agree Tch: W 18 6 0.815 ; 
e Teachers are supported by their colleagues to try out new ideas in teaching. ee Items 
e Teachers respect other teachers who take the lead in school improvement efforts. 


e Coaches and/or mentors are respected by teachers at this school. 


e There is an atmosphere of trust and mutual respect between teachers and school 
administrators. Risen aiees 
ees ; - 5-pt; agree Tch: W 18 o 0.906 “9 
e Teachers feel comfortable raising issues and concems with school administrators. items 


e The school administration consistently supports teachers. 


e Peer collaboration is supported by administrators at my school. 5-pt; agree = Tch: Spr 18 1 NA None needed 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 18 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE B1. COMPOSITE CREATION FOR INTERMEDIATE OUTCOMES 


Scale of items Data No. of Cronbach's Method of 
Outcome on which to assess impact and items included (post recoding) source items FF el a¥-) composite creation 


e Our principal/assistant principal provides support for professional learning. 


Average across 


e Our principal/assistant principal provides the support teachers need to improve our S-pt; agree Tch, W 18 2 0.752 sre 


science instruction. 


e Teachers are relied upon to make decisions about educational issues. 


® Teachers are encouraged to participate in school leadership roles (e.g., leader of a 5-pt; agree Tch: W 18 2 0.675 Average across 
professional learning community (PLC), mentor, member of the School Improvement Items 
Team). 


Note. Under the Data Source column: Tch = Teacher, F = fall, W = winter, Sor = spring, 18 indicates that the survey was administered in Year 2 (2017-18). 


Appendix C. Teacher Content Knowledge Assessment: Item-Level Information 


This appendix presents additional information about the teacher content knowledge assessment, including the source, brief description, 
proportion correct, biserial correlation, and item difficulty and discrimination for a 2-Parameter Logistic (2PL) model. 


TABLE C1. ITEM-LEVEL INFORMATION FOR THE TEACHER CONTENT KNOWLEDGE ASSESSMENT 


Proportion Biserial 2PL 2PL 


Item ID correct correlation difficulty discrimination Source Description 


a7 sod -1.800 0.871 _ ee Sune Water cycle/Energy (diagram) 
949 393 -1.946 ZT ue i ans Water cycle/Process (diagram) 
653 wo -0.880 0.816 MOSART Physics Gravity on objects 

703 163 -2.437 0.365 MCAS compare two waves 

881 151 -3.311 0.653 NY Regents ES Jan 2017 Reflecting insolation 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 19 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE C1. ITEM-LEVEL INFORMATION FOR THE TEACHER CONTENT KNOWLEDGE ASSESSMENT 


Proportion Biserial 2PL 2PL 
Item ID correct correlation difficulty discrimination Source Description 
788 A405 -1.399 1.178 MOSART Astronomy Earth’s rotation 
8/3 262 -2.107 1.109 MCAS ID which part is wavelength 
120 246 -1.289 0.840 MOSART Earth Science Evaporation 
551 425 -0.190 1.415 seal yell Density 
cience 
881 ZS -3.821 0.595 MOSART Earth Science Mountains/tectonics 
254 so27 1,165 1.164 MOSART Physics Transfer of energy/open system 
034 166 -0.362 0.387 MOSART Astronomy Sun & ice/temp 
o25 .270 -0.150 0.742 ad oe euite Heat transfer through conduction (diagram) 
305 .207 1.770 0.491 Ne ie une Transfer of energy in a system (block on table) 
949 168 -4.787 0.651 ane oo ae Intensity of insolation (diagram) 
881 oo -2.056 1.218 NAEP Ice melts 
831 A411 =1.957 1.334 MCAS Phase change/physical change 
534 524 0.097 2.150 ty aaa Molecules 
576 250 0.552 0.600 aa Sun movement 
881 408 -1.725 1.652 NAEP Sun warms water 
E26 348 0.792 0.796 me ere mune Light waves 
A466 162 0.309 0.466 saa eee lee Chemical change 
cience 
A15 224 0.704 0.518 Ne atin Diagram path of ball thrown through the air. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 20 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE C1. ITEM-LEVEL INFORMATION FOR THE TEACHER CONTENT KNOWLEDGE ASSESSMENT 


2PL 2PL 
difficulty discrimination Source Description 


-0.607 0.938 NY Regents ES Jan 2017 Heat transfer (diagram) 


Proportion Biserial 
Item ID correct correlation 


839 328 -1.936 1.013 NECAP temp of spoons in water--heat transfer 
432 347 0.376 0.850 MOSART Chemistry Molecular structure, phys change 
eee 407 -1.021 1.246 MOSART Earth Science Earth closed system 

924 236 -3.007 0.950 NECAP hammer nail, energy transfer 
619 .300 -0.819 0.644 ne Se UN acceleration due to gravity (graph) 


Note. 2PL = 2-Parameter Logistic Item Response Theory score calibration; ES = Earth science; PS = physical science; NECAP = The New England Common Assessment 


Program; NAEP = National Assessment of Educational Progress; NY = New York; MOSART = Misconception-Oriented Standards-Based Assessment Resources for Teachers; 
MCAS = Massachusetts Comprehensive Assessment System 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 21 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Appendix D. Assessment of Student Science Achievement: Construction, Forms, 
Administration, Item Statistics, and Approaches to Scaling 


This appendix provides the details on the student science achievement assessment.’ We describe the 
assessment’s construction, test forms, administration, and approaches to scaling. We also provide 
select item-level statistics based on Classical Test Theory (CTT) and Item Response Theory (IRT). In 
describing the test forms, we provide brief descriptions of other types of items that were included in 
the science assessment, such as the constructed-response items and student survey scales, in order to 
fully present what was asked of students in spring Year 2 (2017-18). 


CONSTRUCTION OF THE STUDENT SCIENCE ASSESSMENT 


This evaluation was conducted in the 2016-17 and 2017-18 school years, just three years after the 
Next Generation Science Standards (NGSS) were rolled out. Therefore, we faced the immense 
challenge of finding an established NGSS-aligned assessment to evaluate impacts of Making Sense of 
SCIENCE on student science achievement in Grades 4 and 5. 


In the summer of 2014-2015 and throughout the 2015-16 school year, we conducted a search for 
NGSS-aligned instruments to measure student science achievement. We short-listed potential 
instruments and reached out to several assessment developers, including 1) Education Testing 
Service (ETS) about the Cognitively Based Assessment of, for, and as Learning (CBAL®), 2) 
Northwest Evaluation Association (NWEA) for their Measures of Academic Progress (MAP) 
assessment, and 3) the California Department of Education about their new NGSS-aligned science 
assessment. We found that these instruments were not far along enough in the development process. 
In fall 2015-2016, at the suggestion the Making Sense of SCIENCE Technical Working Group (TWG), 
we opened discussions with a university-based center that at the time was partnering with the state 
department of education to administer an NGSS-aligned science assessment. When we approached 
the center, the science assessment had been field tested the prior school year (2015-16) and was 
operational in 2016-17. The study team made a few adjustments to the assessment to be suitable for 
administration in this study, such as supplementing grade-appropriate items for students in the 
study and shortening the test in order to administer it within one hour. Prior to the administration of 
the assessment, but when it was too late to change course, evaluators and program developers were 
provided the full assessment —as opposed to just the sample of items that we were shown 
previously —for review. The team then recognized that the assessment was inadequate due to the 
inaccuracies in the science content and the verboseness of the questions, which would have been 
especially problematic for English learner students in the study. Despite this recognition, the lack of 


' Note that the “student science achievement assessment” refers to the selected-response items of the general science 
assessment administered to students, which included both selected-response and constructed-response items. 


2 This appendix focuses on the selected-response items of the student science assessment. For item-level statistics of the 
constructed-response items, contact the developers (Heller Research Associates). 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 22 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


options at the time compelled the team to administer the test to students in the spring of Year 1 
(2016-17). In the fall of Year 2 (2017-18), we faced a difficult decision: either to go with an assessment 
in its second year of operation that was problematic in ways described above, or to proceed with 
developing an assessment with no opportunity to pilot. Without yet knowing the result from the 
exploratory year, evaluators and program developers jointly decided to not continue using the 
assessment and to instead develop an assessment with the guidance of TWG members. 


The study team constructed the student science assessment using selected-response items from 
publicly available sources and constructed-response items developed by HRA. The selected-response 
items originated from the following publicly-available tests and item banks: Massachusetts 
Comprehensive Assessment System (MCAS), The New England Common Assessment Program 
(NECAP), the Misconceptions-Oriented Standards-Based Assessment Resources for Teachers 
(MOSART), American Association for the Advancement of Science (AAAS), and National 
Assessment of Educational Progress (NAEP). Two external reviewers with content and test- 
development expertise reviewed items for accuracy, clarity, and alignment with NGSS. To preserve 
the original items as much as possible, we asked the external reviewers to select, reject, or suggest 
only minor revisions to each item. We only made minor revisions to selected items. 


TEST FORMS OF THE STUDENT SCIENCE ASSESSMENT 


We ultimately assembled a pool of 49 selected-response items aligned with fourth and fifth grade 
standards in Earth and space science (10 for fourth grade, 9 for fifth grade), physical science (10 for 
fourth grade, 10 for 5th grade), and scientific inquiry (10 items common to fourth and fifth grade). 


The basic organization of test forms is displayed in Figure D1. 


Forms A-type (Al — A4) were administered to fourth graders and contain the same 30 selected- 
response items across the four forms. Forms A1 — A4 are differentiated in terms of the survey 
questions they ask. Forms A-type include 10 selected-response inquiry items that are common across 
all 16 forms of the assessment. 


Forms B-type (B1 — B4) were administered to fifth graders and contain the same 29 selected-response 
items across the four forms. Forms B1 — B4 are differentiated in terms of the survey questions they 
ask. Forms B-type include 10 common selected-response inquiry items (the same as those in the 
fourth-grade A-type forms). 


A random sample of about 20% of fifth-grade students received the remaining eight forms (Forms C 
to J). These forms contained specific combinations of constructed-response items designed to test 
“communication of science ideas in writing”, as discussed in Chapter 8 and Appendix R. A random 
sample of fourth grade students received form E, the only form with constructed-response items 
deemed appropriate for fourth graders. All students responding to forms C to J also responded to the 
10 common selected-response inquiry items. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX ao 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


The 10 inquiry items were included with the goal of potentially linking science scores across all form 
types. Over the course of the project, we determined that selected- and constructed-response items 
were measuring different skills and should not be combined. Two of the inquiry items were 
problematic (see next subsection), and three of the inquiry items tapped the life science content 
strand. This made the set inappropriate for linking Earth and space science and physical science 
scores across grades. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 24 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE D1. FORMS FOR THE STUDENT SCIENCE ASSESSMENT AND SURVEY 


Selected-response items Constructed-response items 


Ath and 5th 
Gace Ath grade 5th grade inquiry 

Form administered PRG Gut) (19 items) (10 items) 
. Pd 
. a 
. a 
— 
: — 
a 
a 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 20 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


ADMINISTRATION OF THE STUDENT SCIENCE ASSESSMENT 


In spring Year 2 (2017-2018), teachers administered the student science assessment and survey to 
their students. Students received computer access and a 60-minute period to complete the 
assessment. Per district requests, we emphasized to teachers that all required district and state testing 
must be prioritized over the Making Sense of SCIENCE assessment. We closely monitored teachers’ 
completion progress and sent periodic reminders. 


The assessment was administered on the Quest platform, an online testing system developed by the 
3-C Institute for Social Development. We selected the Quest platform because its design incorporates 
Universal Design principles, including accommodations such as text-to-speech. We wanted to include 
the text-to-speech feature in order to accommodate English learner students, students with reading 
disabilities, and or students with limited literacy. Therefore, we provided class sets of headphones to 
teachers who needed them for their students to use during testing. Additionally, we asked teachers to 
not administer the science assessment and survey to students who take the alternative or modified 
state assessments. For students who need testing accommodation (other than voice-over) per their 
Individualized Education Programs, we asked teachers to use their discretion in deciding whether it 
was feasible to test such students. 


IDENTIFYING ITEMS TO BE REMOVED PRIOR TO IMPACT ANALYSIS 


After we collected the student science achievement assessment data, an initial analysis of the items 
led us to remove several items prior to further analysis. We removed 3 life science items from the 10 
inquiry items because Making Sense of SCIENCE did not address this content strand. We also 
removed one item because of an abnormally high level of non-response and one item because one of 
the incorrect response options was a strong distractor selected by many students. Both of these items 
led to instability of item calibration using IRT, so we removed them from the assessment for both 
grades. Analysis of Differential Item Functioning (DIF), conducted by an independent contractor, 
revealed none of the items to be problematic. Therefore, we removed no additional items. The final 
forms included 25 selected-response items in Grade 4 and 24 selected-response items in Grade 5. We 
used these items to estimate achievement. 


We calculated item parameter values and IRT-based scale scores using the full analytic sample used 
to run confirmatory impact analyses (N = 2,140). The goal was to have the closest possible 
correspondence between the sample used for both score calibration and impact estimation. Tables D2 
and D3 display select item statistics for the fourth- and fifth-grade science achievement tests. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 26 


EFFECTIVENESS OF WESTED’S MAKING SENSE OF SCIENCE 


TABLE D2. SUMMARY OF CLASSICAL TEST THEORY AND ITEM-RESPONSE THEORY PARAMETERS FOR 4TH GRADE 


Factor 1 Factor 2 

loadings KoF-Tol ale F 

(Oblique (Oblique Percent Biserial Difficulty Difficulty Discrimination Difficulty Discrimination 

rotation) igey = 1a (oJ a)) correct correlation (1PL) (74518) 74518) (3PL) (3PL) 
0.476 0.103 Oo2 270 -1.076 -0.611 1.453 -0.298 1.749 Eno NECAP 
0.450 -0.040 439 366 0.419 0.266 iis 0,552 1.462 PS NAEP 
0.435 -0.039 443 364 0.388 0.256 1.066 0.653 1.738 ee NAEP 
0.426 0.082 041 00 -0.284 -0.196 1,091 O11 L275 PS NECAP 
0.415 -0.010 070 343 0.760 0.532 1.001 0.806 1.296 EoD NECAP 
0.415 -0.014 388 2000) 0773 O52 1.025 0.861 1:63) Ess MCAS 
0.401 0.166 .600 oz -0.696 -0.492 1.020 -0.106 1.219 Eso MCAS 
0.395 0.118 AZT wee -1.669 -1.063 1.169 -0.753 1.288 ESS NAEP 
0.381 -0.109 21 le 1.272 0.974 0.871 T2590 1.540 ESS NECAP 
0.361 -0.091 410 200 0.620 O12 0.805 0.866 1031 ro invented 
0.336 -0.140 366 aa fe 0.932 0.844 0.720 1221 1.003 ESS NAEP 
0.300 0.150 2 241 -0.499 -0.464 0.702 -0.044 0.760 Fo MCAS 
0.275 -0.069 334 243 1174 L135 0.650 1.431 1.322 ro MOSART-r 
0.281 0.260 464 246 0.244 0.243 0.639 0.774 C773 ee AAAS 
0.277 -0.011 Re P46) 221 1.246 1.256 0.633 1.556 0.899 EDS MCAS 
0,252 0.170 418 wee 0.562 0.661 Oo | 1.224 0.701 Eso NAEP 
0,235 -0.001 263 203 1.747 1.948 0.564 1.767 L315 ESS MOSART 
0.148 -0.113 305 109 1.398 2.564 329 1.880 2.179 ESS NAEP 
0.130 0.066 394 106 0.730 1.604 O272 2.616 0.412 PS NECAP 
0.101 0.005 449 .084 0.346 1.024 0.201 2622 0.344 PS NAEP 
W097 -0.014 22] .080 2.076 D.do7 O.232 4.539 0.545 ESS NAEP 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 2/ 


EFFECTIVENESS OF WESTED’S MAKING SENSE OF SCIENCE 


TABLE D2. SUMMARY OF CLASSICAL TEST THEORY AND ITEM-RESPONSE THEORY PARAMETERS FOR 4TH GRADE 


Factor 1 Factor 2 
loadings loadings 
Co} -)ife(t-) Co} -)ife(t-) Percent Biserial Difficulty Difficulty Discrimination Difficulty | Discrimination 
geye-\ao),)) igeye=1a (oJ a)) correct correlation (1PL) 74518) 74528) (3PL) (1518) 
22 0.081 -0.019 oie .080 1.677 5.704 0.174 2.309 2.147 ESS MOSART-r 
23 0.086 0.328 615 077 -0.804 -2.121 0.224 -0.353 0.267 ro AAAS 
24> 0.136 -0.178 389 .098 0.768 1.692 0.272 2.570 0.725 PS NAEP 
25 0.134 -0.168 326 .105 les4 2./84 0.265 2.505 1.122 PS MCAS 
Proportion 
variance 2.411 0.411 
explained 2 
UNO ia the 007 
correlations 
Reference 
axis .007 


correlations 


Note. 1/2/3 PL = 1/2/3-Parameter Logistic; NECAP = The New England Common Assessment Program; NAEP = National Assessment of Educational Progress; MOSART = 
Misconception-Oriented Standards-Based Assessment Resources for Teachers; MCAS = Massachusetts Comprehensive Assessment System; AAAS = American Association for the 
Advancement of Science; ESS = Earth and space science; PS = physical science 


4 eliminating other factors 


HicelaamUv-\oml are [Ulol-Yo ml am oXolaaMrol0|auaire|e-\o(-m-lalemilivanve|s-le\-Mrelaaar 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 28 


EFFECTIVENESS OF WESTED’S MAKING SENSE OF SCIENCE 


TABLE D3. SUMMARY OF CLASSICAL TEST THEORY AND ITEM-RESPONSE THEORY PARAMETERS FOR 5TH GRADE 


Factor 1 Factor 2 

loadings loadings 

(Oblique (Oblique Percent Biserial Difficulty Difficulty Discrimination _ Difficulty = Discrimination 
igeye-1a(e)a)) igeye-\4e)))) correct correlation (1PL) 74518) (7-421 B) (3PL) (3PL) 

1 0.434 -0.021 ott 296 1,079 0.557 1.113 0.81 1.445 ro MOSART 
0.412 -0.007 eer) 316 1.009 0.724 1.009 1.031 1.718 PS NECAP 
0.396 0.056 483 307 0.149 0.081 O95 0.564 1.438 ESS MOSART 
0.386 -0.009 A491 wag | 0.074 0.041 0.912 0.462 1.201 ESS MCAS 
0.320 O21 odes) 246 1474 1.134 0.769 1.445 0.999 ee MOSART 
0.284 0.138 209 217 1.934 1.473 0.669 1.607 1.402 ESS MOSART 
0.283 -0.187 .66 189 -1.428 -1.116 0.651 -0.592 0719 PS NECAP 
0.392 -0.190 FCO 209 -2.779 -1.306 1,292 -0.974 1.556 ESS NAEP 
O22) 0.144 al, 155 2.734 2.466 0550 2.268 1LA%6 ro AAAS 
0.207 O.17S 326 161 1.568 1.632 0.469 1.818 1.403 PS MOSART 
0.349 -0.054 403 247 0.842 0.573 0.766 0.973 1.008 Eso NAEP 
0.180 OOT9 202 a8 1.8 2.141 0.405 2.501 0.689 ro MOSART 
0.145 -0.100 336 1 1.464 2.426 0.286 2.934 0.543 Eso MCAS 
OSS 0.006 A05 1Q2 0.825 1.369 0.285 2.635 0.399 Eso MOSART 
0.123 -0.011 389 087 O96/ 1.673 0.274 2.661 0.426 ro MOSART 
0.298 -0.107 416 207 0.727 0.569 Recs 107 O93 bod NAEP 
Q,059 0.009 309 044 1./32 6.299 0.098 6.784 ee? PS MOSART 
OS2 0.264 265 147 a re Z.00 0.366 2.078 1.719 Eos NAEP 
-0.048 0,235 236 -.005 2.021 -11.005 -0.107 3.073 LOSS ro NECAP 
0.051 -0.083 425 .018 0.65 Sudo 0.09 S015 0.206 PS NAEP 
0.114 0.227 418 le OF 12 1.326 0.253 2.016 1527 Eos NAEP 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 29 


EFFECTIVENESS OF WESTED’S MAKING SENSE OF SCIENCE 


TABLE D3. SUMMARY OF CLASSICAL TEST THEORY AND ITEM-RESPONSE THEORY PARAMETERS FOR 5TH GRADE 


Factor 1 Factor 2 
loadings loadings 
(Oblique (Oblique | Percent Biserial Difficulty Difficulty Discrimination _ Difficulty Discrimination 
igeye-1a(e)a)) ixeye-\ a e)))) correct correlation (1PL) 74518) 7-431 B) (3PL) (3PL) 
22 0.086 0.213 302 Abbe: 1.807 4.516 0.187 2./01 1.514 ESS MOSART 
23 0.103 -0.130 oo 057 1.478 3.200 0.195 3.017 0.468 Ess MCAS 
245 0.181 0.051 a76 147 0.911 1.168 0.374 1752 0.956 ee NAEP 
Proportion 
variance 1.570 0.433 
explained 2 
GR tase 030 
correlations 
Reference 
axis -.030 


correlations 


Note. 1/2/3 PL = 1/2/3-Parameter Logistic; NECAP = The New England Common Assessment Program; NAEP = National Assessment of Educational Progress; MOSART = 
Misconception-Oriented Standards-Based Assessment Resources for Teachers; MCAS = Massachusetts Comprehensive Assessment System; ESS = Earth and space science; PS = 


physical science 


cm -lifeallat-idlare melsal-lanr-loixe) 6s 


Micelaamuv-\oml are [Ulol-Yo ml am oXoliaMrol0|auaire|e-\o(-M-lalemilivanre|s-le\-Mrelsaays 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 30 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


CHOOSING AN APPROACH TO SCALING 


Under the guidance of a psychometrician, we examined the characteristics of the assessment. It was 
notably difficult based on examination of item percent-correct scores. In Tables D4 and D5, we 
display averages of percent-correct scores by decile of ELA and math third-grade pretest scores. 
Figures D1 and D2 show the average percent-correct scores on the science achievement assessment 
(by treatment and control) across deciles of the ELA pretest and math pretest distributions. We 
observe that proportions correct are low, with percent-correct scores below 50% across most of the 
pretest distributions. 


TABLE D4. MEAN PERCENT CORRECT SCORES ON THE STUDENT SCIENCE ACHIEVEMENT 
ASSESSMENT BY DECILE OF ELA PRETEST ACHIEVEMENT (AVERAGED ACROSS GRADES 4 AND 5) 


Mean Std dev Minimum Maximum 

0.30 0.12 0.00 0.80 
214 0.32 0.11 0.04 0.64 
214 O32 0.12 0.04 O72 
214 0.35 Og 0.04 0.76 
214 0.40 O13 0.08 0.80 
214 0.41 0.13 0.16 0.84 
214 0.44 0.14 ots 0.84 
214 0.46 O13 0.16 0.84 
214 0.51 O15 0.17 0.88 


C37 0.14 0.24 0.88 


TABLE D5. MEAN PERCENT CORRECT SCORES ON THE STUDENT SCIENCE ACHIEVEMENT 
ASSESSMENT BY DECILE OF MATH PRETEST ACHIEVEMENT (AVERAGED ACROSS GRADES 4 AND 5) 


Mean Std dev Minimum Maximum 
0.30 O12 0.00 0.80 
0.32 0.11 0.04 0.64 
0.32 0.12 0.04 O72 
0.29 0.11 0.04 0.80 
0.31 0.12 0.04 0.80 
0.34 0.13 0.00 0.72 
0.37 O12 0.12 O79 
0.39 0.14 0.08 0.76 
0.41 0.14 0.08 0.84 
0.43 O12 O17 O76 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 3| 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


8- TC 


Deciles of ELA pretest 
O 
| 


0.3 0.4 0.5 
Percent Correct 


FIGURE D1. DIFFERENCES IN PERCENT CORRECT ON THE STUDENT SCIENCE ACHIEVEMENT 
ASSESSMENT BETWEEN CONDITIONS BY DECILE OF ELA PRETEST 


Note. C is control: T is treatment 


Deciles of MATH pretest 
© 
—{ 


0.3 0.4 05 
Percent Correct 


FIGURE D2. DIFFERENCES IN PERCENT CORRECT ON THE STUDENT SCIENCE ACHIEVEMENT 
ASSESSMENT BETWEEN CONDITIONS BY DECILE OF MATH PRETEST 


Note. C is control; T is treatment 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX a2 


0.6 


EFFECTIVENESS OF WESTED’S MAKING SENSE OF SCIENCE 

There was debate among advisors as to the best approach to scaling. One opinion was that a 3- 
Parameter Logistic (PL) model was justified given apparent levels of student guessing. Others 
objected that students would not outright guess, even with difficult items, and that at least some of 
the less proficient students would likely use strategies to, for instance, narrow the number of 
response options. 


Facing a complex choice and recognizing alternative rationales for using different IRT models, we 
opted to analyze impacts with four approaches to scaling: as percent-correct and using three standard 
IRT-based models, 1PL, 2PL, and 3PL.° Our reasoning was that if impacts are robust to the choice of 
scaling, it would add more confidence to our result. We also viewed this as an opportunity to 
conduct research on an interesting question: whether different approaches to scaling would lead to 
similar results that support the same conclusion about program impact. The Test Characteristic and 
Test Information Curves for the three IRT models (displayed in Figures D3, D4, D5) show different 
patterns, with the 3PL model reflecting minimal information on the low end of the achievement scale. 
This is not surprising given that students responded correctly only slightly above the guessing rate at 
the low end of the scale. This suggested the scale would be non-discriminative of ability in that range, 
potentially limiting reliability of individual scores, as well as the precision of average achievement 
scores and impact estimates in that interval. On the other hand, the correlations among the scores 
with the four approaches to scaling were high (see Figures D6 and D7), possibly making little 
difference to average scores and estimates of impact. Thus, it was not clear what the effect of the 
different approaches to scaling would be on student science achievement and on the impact on that 
outcome. (We show the results for fourth grade; very similar results were obtained for fifth grade.) 


Group |. Test Characteristic Curve Group 1 Totd infimation Ouve 


FIGURE D3. TEST CHARACTERISTIC CURVE AND TEST INFORMATION CURVE FOR 1PL MODEL IN 
FOURTH GRADE (N = 1,220) 


3 We used the software IRTPRO (Cai, Thissen, & du Toit, 2011). We conducted separate score calibrations in Grade 4 and 
Grade 5. We used the Bock-Aitkin EM algorithm in IRTPRO to obtain item parameter and student score estimates. 
AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX es) 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Group 1. Test Characteristic Curve 


CApouicu ouwic 


+ + + + 
“8 2 1 (0) 1 
Theta 


Expected Score 


Total Information 


Group 1, Total Information Curve 


=1 Oo 1 2 S 
Theta 


Total Information = ---7"77> Standard Error 


FIGURE D4. TEST CHARACTERISTIC CURVE AND TEST INFORMATION CURVE FOR 2PL MODEL IN 


FOURTH GRADE (N = 1,220) 


JO PUePUEIS 


Group 1. Test Characteristic Curve 


Capeucy ouwiec 


83 2 -1 ie} 1 
Theta 


Expected Score 


Total Information 


Group 1, Total Information Curve 


Theta 


Total Information = ~~777 77> Standard Error 


FIGURE D5. TEST CHARACTERISTIC CURVE AND TEST INFORMATION CURVE FOR 3PL MODEL IN 


FOURTH GRADE (N = 1,220) 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 


34 


JOUZ Puepueys 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


a 4 -1 Qo 1 
—L = aL 1 


Percent 


0.0 02 04 06 08 


R = 0.96 R — 0.96 theta2param 
, R = 0.95 R = 0.95 R = 0.99 theta3param 


FIGURE D6. CORRELATIONS AMONG SCORES ON THE STUDENT SCIENCE ACHIEVEMENT 
ASSESSMENT CALIBRATED USING DIFFERENT APPROACHES 
(N = 1,220) 


As shown in Appendix M, among our sensitivity analyses for the confirmatory test of impact on 
student science achievement, we examined results using 24 impact models: 3 covariate sets (no 
covariates, pretest as the only covariate, and with a full set of covariates) x 2 ways of modeling 
randomized blocks (fixed or random) x 4 calibration methods (percent-correct and 1PL, 2PL, and 3PL 
scaling). All impact models included random effects for schools (the unit of random assignment). 
Then using Type-3 tests of fixed effects, we examined whether, for the 24 approaches, impacts varied 
depending on the three main criteria informing the impact model: the approach to scaling, the 
covariates used in analysis, and whether randomized blocks were modeled as fixed or random. 


None of the impact estimates reached statistical significance (all p values were greater than .30). The 
Type-3 tests of fixed effects revealed that among the 24 combinations of approaches to modeling 
impact, estimates did not vary beyond chance depending on scaling (p = .996), but they did vary 
depending on which covariates were used (p < .001), and depending on whether the randomized 
blocks were modeled as fixed or random (p < .001). A notable result is that impact estimates ranged in 
values between -.028 and .081 standard deviations, which is a substantial difference considering that 
impacts as small as .05 standard deviations are considered substantively important (Bloom et al., 
2008). 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX od 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Appendix E. All Student Survey Scales Measuring Opportunity to Learn and Non- 
Academic Outcomes 


This appendix provides the student survey scales that were administered to students to measure 
students’ opportunity to learn and non-academic outcomes in spring of Year 2 (2017-18). The set of 
survey scales consisted of six scales modified from the Friday Institute for Educational Innovation, 
TIMSS 2015 Questionnaire, and the Colorado Education Initiative. Modifications include the addition 
or removal of items, and modifications to the answer scales. We also created two survey scales to 
measure cognitive demand and agency in learning. 


MUA 


Items with an “*” were reverse coded before analysis. 


ITEM SET 1 (FORMS Al AND B1) 


Aspirations 
To what extent do you agree or disagree with the following statements about science (5-point scale: 
Strongly disagree, Disagree, Neither disagree or agree, Agree, Strongly agree) 


a) Iexpect to use science when I am an adult. 

b) Knowing science will help me get a job. 

c) I would consider having a job in science. 

d) Knowing science will help me in my work when I am an adult. 


Source: Friday Institute for Educational Innovation (2012). Elementary School STEM - Student Survey. 
Raleigh, NC: Author. 


Quality of Science Class — Learning Environment/Classroom Management 


How often do the following things happen in your science class? (5-point scale: Almost never, Once 
in a while, Sometimes, Frequently, Almost always) 


a) Our class stays busy and does not waste time when doing science. 

b) Students in my class are respectful to our teacher during science class. 

c) Students in my class behave the way my teacher wants them to during science class. 

d) Students in my class know what they are supposed to be doing and learning in science. 
e) Students in my class listen to each other when someone is sharing their ideas about science. 
f) Students like raising their hands and asking questions in science. 

g) The behavior of other students in my science class helps my learning of science. 

h) Students share their science ideas in class. 

i) The teacher respects students’ science ideas in my class. 

j) Rules are used in our science class to make sure everyone is treated fairly. 

k) The teacher trusts students to take care of science materials. 

1) Our teacher treats us fairly. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 36 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Source: Colorado Education Initiative (2013). Colorado’s Student Perception Survey. Denver, CO: 
Author. 


ITEM SET 2 (FORMS A2 AND B2) 


Self-Efficacy 
To what extent do you agree or disagree with the following statements about science (5-point scale: 
Strongly disagree, Disagree, Neither disagree or agree, Agree, Strongly agree) 


a) ITusually do well in science. 
b) Science is harder for me than for many of my classmates.* 
c) [am just not good at science.* 
d) Ilearn things quickly in science. 
e) My teacher tells me I am good at science. 
f) Science is harder for me than any other subject.” 
g) Science makes me confused.* 


Source: TIMSS 2015 Student Questionnaire. Copyright © 2014 International Association for the 
Evaluation of Educational Achievement (IEA). 


Publisher: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College. 


Activities in Science Classroom 
How often do you do these things in your class when you are learning science? (5-point scale: 
Never, Almost Never, Sometimes, Very Often, Always) 


a) I watch the teacher do a science experiment. 

b) I plan or do a science experiment or project on my own. 

c) I work with other students in a small group ona science experiment or project. 
d) Iread about science. 

e) I write an explanation for something I am studying in science. 

f) I discuss with other students the things I am studying in science. 

g) I discuss with my science teacher the things I am studying in science. 


Source: TIMSS 2007 Student Questionnaire. Copyright © 2007 International Association for the 
Evaluation of Educational Achievement (IEA). 


Publisher: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX of 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


ITEM SET 3 (FORMS A3 AND B3) 


Quality of Science Class — Science Instruction 


To what extent do you agree or disagree with the following statements about your class when you 
are learning science? (5-point scale: Strongly disagree, Disagree, Neither disagree or agree, Agree, 
Strongly agree) 


a) My teacher plans interesting things for us to do. 

b) My teacher makes us think. 

c) My teacher wants us to talk about what we think. 

d) My teacher asks us to write down what we do, think, and observe. 
e) My teacher thinks we can learn challenging science. 

f) My teacher tells us it is okay to be wrong sometimes in science. 

g) My teacher asks interesting questions. 


Source: TIMSS 2015 Student Questionnaire. Copyright © 2014 International Association for the 
Evaluation of Educational Achievement (IEA). 


Publisher: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College. 


Agency in Learning 
How often do the following things happen in your class when you are learning science? (5-point 
scale: Almost never, Once in a while, Sometimes, Frequently, Almost always) 


a) The teacher asks me to share my ideas in science. 

b) Ihave choices about what I learn in science. 

c) The teacher tells us what to do in science class.* 

d) Students get to figure things out in my science class. 

e) The teacher does most of the explaining in my science class.* 
f) The teacher asks students to lead science activities. 


ITEM SET 4 (FORMS A4 AND B4) 


Cognitive Demand 


How often do the following things happen in your class when you are learning science? (5-point 
scale: Almost never, Once in a while, Sometimes, Frequently, Almost always) 


a) Ilearn challenging things in science class. 

b) Ihave to think hard to figure things out in science class. 
c) The teacher asks me to explain my ideas in science class. 
d) The teacher encourages me to work hard in science class. 
e) The teacher has high expectations for me in science class. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 38 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Enjoyment of Science 


To what extent do you agree or disagree with the following statements about learning science (5- 
point scale: Strongly disagree, Disagree, Neither disagree or agree, Agree, Strongly agree) 


a) Tenjoy learning science. 

b) I wish I did not have to study science.* 

c) Science is boring.* 

d) I learn many interesting things in science. 

e) [like science. 

f) Ilook forward to learning science in school. 

g) Science teaches me how things in the world work. 
h) I like to do science experiments. 

i) Science is one of my favorite subjects. 


Source: TIMSS 2015 Questionnaire. Copyright © 2015 International Association for the Evaluation of 


Educational Achievement (IEA). 
Publisher: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 37 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Appendix F. Description of the Principal Surveys 


This appendix provides a description of the principal surveys, which were administered in spring 
2016-17 and spring 2017-18. Heller Research Associates (HRA) analyzed and reported the survey 
responses by (Wong et al., 2020). 


The purpose of the administrator survey (intended for principals or vice principals) was to capture 


information about each school and its leadership, particularly in relation to science instruction at 


baseline and throughout the course of the study. The surveys covered a range of topic areas, 
including: 


how science instruction is prioritized compared to other subjects at the school, barriers and 
supports for science instruction, and resources available for science instruction; 
philosophy about and confidence in teaching and learning science and attitude toward 
change; 

perceived influence in and capacity to support teachers in a number of areas such as 
presenting opportunities for professional learning, supporting collaboration, and giving 
instructional feedback; 

familiarity with and attitudes toward NGSS; 

teacher and administrator turnover rates, school climate and the dynamics among 
administrators and teachers at their school, and the culture of collaboration; 

professional learning implemented at the school; 

education and teaching background including years of experience teaching and in school 
leadership positions; and 

demographic information such as race/ethnicity and gender (on baseline survey only). 


For each survey, either the principal or the vice principal (but not both) would complete the survey. 
Administrators who joined the study after randomization did not receive the baseline survey, but did 
answer a subset of questions, including demographic and teaching background information. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 40 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Appendix G. Description of the Pilot of the Video and Audio Recordings 


This appendix documents the pilot of the classroom video recordings conducted in spring of Year 1 
(2016-17) and the audio recordings collected in spring of Year 2 (2017-18). 


PILOT OF CLASSROOM VIDEO RECORDINGS 


During spring 2017-18, researchers piloted a process to collect data on classroom instructional 
practices and students’ discourse patterns through video recorded classroom observations. The pilot 
process included obtaining active and passive parental consent, training local camera operators to set 
up classroom sets of video/audio equipment, scheduling the observations, and collecting the data 
from a subset of study teachers. The purpose of the pilot was to estimate parental consent response 
rates and determine the feasibility of scheduling for the full sample of schools. The pilot also allowed 
researchers to test if the type and quality of the video and audio captured would be sufficient for use 
with the classroom observation scoring protocol, which was also in development. 


Parental Consent Process 


The parental consent process was piloted in 21 schools (9 schools in districts that required active 
parental consent and 12 schools in districts that required passive parental consent). In the districts 
that required active parental consent, approximately 35% (8/23) of teachers had a somewhat 
acceptable number of students (more than 10) who agreed to be video recorded. In the districts that 
required passive parental consent, 88% (23/26) of teachers had an acceptable number of students who 
could be video recorded. However, several teachers expressed that it would be too burdensome to 
remove the students who were not allowed to be recorded from class on the day of the observations. 
Teachers reported that they did not want these students to miss the lesson and did not have a central 
place for the students to go during this time. Logistically, this was a challenge for teachers and 
researchers. 


Scheduling and Set Up 


The study team piloted the scheduling and data collection process with 15 teachers from two districts 
in California that required passive parental consent. Researchers sent teachers the following 
instructions regarding recording. 


e We intend to record your classroom when science instruction is taking place. This means that 
dates and times during which students are taking tests, watching movies, etc. should not be 
included as potential video observation sessions. 


e We intend to record your classroom when the teacher who is participating in the Making 
Sense of SCIENCE study is teaching, not a teaching assistant or instructional specialist. 


e We would prefer to record an earth science or physical science lesson, but recognize that this 
may not be possible given your existing plans for science instruction. 


e We would like to see how teachers support students’ science dialogue. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 41 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


e There will be two sets of cameras at each school during the week they are being recorded. This 
means that two teachers can be recorded on the same day at the same time. There will be a 
camera operator who will come to your classroom at the time of the scheduled observation to 
set up and take down equipment. However, this person will not stay in the classroom during 
the recording. 


Ideally, we'd like to record your class for an entire lesson arc for a scenario where the class is 
introduced to something, do or observe an investigation/lab/or demo, and then talk about it. At a 
minimum, we'd like to record you for two consecutive lessons for a total of 60 minutes of science 
instruction. For example: 


e If you teach a 90-minute lesson, we'd record one lesson. 
e If you teach 45-minute or 60-minute lessons, we'd record two consecutive lessons. 
e If you teach 20-minute lessons, we'd record three consecutive lessons. 


We understand that your science instruction may not fit into these three examples, so please let us 
know if you have a different structure, and we will work with you to figure out what is best. If 
possible, we would prefer to finish recordings at a school within one school week. 


We hired local camera operators and trained them to set up with the Swivl units, iPads, and 
microphones. In each classroom, one Swivl rig tracked the movement of the teacher’s microphone 
(attached to a lanyard around their neck). We set up the second Swivl rig in the back of the classroom 
to capture the board/projector, and placed the other microphones on the right or left side of the 
classroom, out of reach of students. Camera operators set up and removed the equipment but were 
not present in the classroom during the recording. Camera operators also collected, if available, a pre- 
observation form, lesson plans, and photos of student artifacts at the end of each recording. We 
uploaded all data to a central repository for viewing and coding. 


Results of Video Pilot 


The pilot produced approximately 19 hours of video from 12 teachers (3 teachers were unable to be 
recorded for various reasons after scheduling). Given the issues with consent response rates and the 
resource intensive process, the study team decided not to collect video from the full set of teachers in 
year 2 of the study. 


Summary of the Audio Study 


Researchers continued to investigate classroom instructional practices through a modified data 
collection plan that did not include video, or the need for scheduling data collection during specific 
times. In spring 2018, the Making Sense of SCIENCE research team collected classroom data through 
audio recordings, which were supplemented by a survey, teacher self-recorded interview, artifacts, 
and photos of instructional materials and classroom activities. The purpose of the audio recordings 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 42 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


and interviews was to capture NGSS-aligned instructional practices and decisions and to determine if 
Making Sense of SCIENCE impacted the enacted instruction. 


Study teachers in six of the seven districts were invited to participate in this data collection. Of the 
105 teachers who were invited to participate, 26 agreed (15 in control schools and 11 from Making 
Sense of SCIENCE schools). Of those, 19 teachers completed the data collection activity (9 control and 
10 Making Sense of SCIENCE). The remaining seven teachers reported scheduling issues and were 
unable to complete the activity. The research team mailed “audio recording kits” to the teachers, 
which included parental consent forms, an audio recorder, a disposable camera, an information 
survey about the recorded lesson, and instructions. 


Teachers were given the following guidelines for deciding which lesson(s) to record. 
e Plan to record 90 minutes of science instruction in one or more consecutive lessons. 


e Pick a lesson that shows how you include next generation science learning (NGSS) in your 
classroom. 


e Focus on Earth science or physical science topics throughout the recorded lessons, if 
possible. If it is not possible, other science topics are acceptable. 


e Do not select times when students are primarily taking tests, watching movies, doing non- 
science work, etc. 


e [tis not necessary to create a lesson solely for the purpose of this recording. 


Teachers wore the USB audio recorder on a lanyard around their neck during the recorded lessons. 
They turned the recorder off when speaking to students who did not have parental consent to be 
recorded. Additionally, we asked teachers to provide photocopies of lesson plans, notes, handouts, or 
materials used by students during the lesson(s), and slides or overheads projected for students. They 
also used the disposable camera to take photos of student or teacher writings or drawings done on 
the board during the lesson(s), as well as any posted instructions, diagrams, and guidelines referred 
to during the lesson. 


After the lesson, we asked teachers to complete a Classroom Information Survey about their class and 
the recorded lesson(s). We also asked them to record a post-lesson reflection interview in response to 
the questions on a teacher interview protocol asking them to reflect on the lesson (what they did, 
what was effective, how they would modify the lesson in the future) and ways in which the lesson 
included aspects of NGSS. Once they completed their audio recording study, they mailed back their 
completed audio kit to researchers for analysis. HRA analyzed the data from the Classroom 
Information Survey, which focused on content, and teachers’ attitude and beliefs before, during, and 
after the lesson (Wong et al., 2020). They state that a secondary analysis of audio-recorded classes and 
teacher interviews would offer more insight into the conceptual orientation of Making Sense of 
SCIENCE versus control classrooms, as well as the nature of student group discussions. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 43 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Appendix H. Hierarchical Linear Model for the Analysis of Impact on Teacher Content 
Knowledge 


This appendix presents the hierarchical linear model used to evaluate impact on teacher content 
knowledge (Equation H1). 


Q R S 
Vix = Bo + Bi treatment, + > Aq Xjq + > Yr Zier + > BLOCK, Dyes + €& + &x 
q=1 T=1 sS=1 
(H1) 


jx is the outcome for teacher j in school k. Treatments is a binary variable at the school level, with 0 
indicating assignment to control and 1 indicating assignment to Making Sense of SCIENCE. The effect 
of the intervention is assessed in terms of the statistical significance of the estimate of 61. The model 
includes effects of covariates at the school level (A,), and at the teacher level (y,-), as well as fixed 
effects for randomized blocks (we assume S blocks with BLOCK;, taking on the value 1 if school k is in 
block s and 0 otherwise.) ¢, and €, represent school- and teacher-level random effects. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 44 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Appendix I. Detailed Results of the Benchmark Analysis of Impacts on Teacher 
Content Knowledge 


This appendix presents the detailed results of the benchmark analysis on teacher content knowledge 
for the Mixed and Retained in Study samples. 


RESULTS OF THE BENCHMARK ANALYSIS OF IMPACTS ON TEACHER CONTENT KNOWLEDGE FOR THE MIXED 
SAMPLE (N = 118) 


TABLE I1. ESTIMATES OF FIXED EFFECTS —- MIXED SAMPLE 


Fixed effects Coefficient | Standard error agate) 
Intercept -2.958 1.058 27 -2.79 010 
Treatment status 0.171 0.134 va 1.43 Rifle 
Content knowledge pretest 3.844 0.522 a9 7.00 <.0001 
Ethnicity is White -0.084 0.334 oY -0.25 802 
Ethnicity is Hispanic -0.159 0.321 ay -0.49 624 
Ethnicity is Black -0.498 0.351 oF -1.42 164 
Ethnicity is Unknown 0.266 0.325 oF 0.82 A418 
Ethnicity is Mixed 0.046 0.445 oF 0.10 918 
Ethnicity is Native American (reference category) 0.000 
Missing ethnicity -0.628 0.647 37 -0.97 00 
Teacher gender is female -0.029 0.196 39 -0.15 884 
Teacher gender is male (reference category) 0.000 
Missing gender 1.497 1.050 ag 1.43 162 
Certificate in Early Childhood Ed. -0.207 0.255 a? -0.81 422 
Certified in Eng. Language Dev. -0.070 0.174 39 -0.41 687 
Has Higher Ed. Degree -0.143 0.113 39 -1.27 212 
Years of teaching experience 0.005 0.009 39 0.63 335 
Missing years teaching 0.014 0.199 oy 0.07 944 
Taught science previous year 0.233 0.206 oF 1.13 .265 
Indicates little time for science instruction 0.088 0.217 39 0.40 .688 
Pretest scale: School context teacher / admin culture -0.220 0.208 39 -1.06 298 
Pretest scale: School context teacher culture 0.090 0.173 39 0.52 605 
Pretest scale: Use of NGSS activities -0.047 0.041 3? -1.14 woe 
Pretest scale: confidence with science content -0.018 0.101 39 -0.18 857 
Pretest scale: confidence in science instruction 0.200 0.137 39 1.46 ASZ 
Pretest scale: confidence with literacy and discourse 0.002 0.147 39 0.01 990 
Pretest scale: Perceived level of influence 0.112 0.097 39 1.15 256 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 45 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE I1. ESTIMATES OF FIXED EFFECTS — MIXED SAMPLE 


Fixed effects Coefficient | Standard error df t-ratio p value 
Pretest Scale: Beliefs about teaching 0.136 0.148 39 0.92 361 
Pretest Scale: Has mentor available at school -0.162 0.244 39 -0.66 le 


Note. We do not include estimates for pair fixed effects in the table. 


TABLE |2. ESTIMATES OF LEVEL-1 AND LEVEL-2 VARIANCE COMPONENTS (RANDOM EFFECTS) - MIXED 
SAMPLE 


Variance Standard 
Random effect component Tage) g Z value p value 
School 047 0.113 0.42 wat 
Residual (teacher) 07 0.107 4.70 < .001 


RESULTS OF THE BENCHMARK ANALYSIS FOR OF IMPACTS IN TEACHER CONTENT KNOWLEDGE FOR THE 
RETAINED IN STUDY SAMPLE (N = 88) 


TABLE 13. ESTIMATES OF FIXED EFFECTS — RETAINED IN STUDY SAMPLE 


Fixed effects Coefficient Standard error t-ratio 
Intercept -0.061 0.790 Z1 -0.08 737 
Treatment status 0.483 0.158 21 307 006 
Content knowledge pretest 3.954 0.544 16 reer <.0001 
Ethnicity is White 0.548 0.366 16 1,50 153 
Ethnicity is Hispanic 0.138 0.356 16 039 704 
Ethnicity is Black -0.458 0.431 16 -1.06 304 
Ethnicity is Unknown 0.270 0.434 16 0.62 42 
Ethnicity is Mixed 0.856 0.462 16 1.85 083 
enngece alas 0 
Missing ethnicity 0.492 0.675 16 0.73 476 
Teacher gender is female -0.154 0.189 16 -0.82 426 
Teacher gender is male (reference 0.000 
category) 
Missing gender -2.242 1.064 16 -2.11 051 
Certificate in Early Childhood Ed. 0.266 0.297 16 0.90 384 
Certified in Eng. Language Dev. -0.117 0.161 16 -0.73 478 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 46 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE 13. ESTIMATES OF FIXED EFFECTS — RETAINED IN STUDY SAMPLE 


Fixed effects Coefficient Standard error df t-ratio p value 
Has Higher Ed. Degree -0.293 0.110 16 -2.66 017 
Years of Teaching Experience 0.004 0.008 16 0.44 .665 
Missing Years Teaching 0.390 0.253 16 1.54 143 
Taught science previous year 77 0.305 16 -0.58 5/0 
Indicates no time science instruction -0.283 0.192 16 -1.48 159 
frecest scale: School context teacher / 0.113 0.174 16 0.65 526 
admin culture 
Pretest scale: School context teacher 0.309 0.166 16 -1.86 082 
culture 
Pretest scale: Use of NGSS activities -0.076 0.046 16 -1.66 117 
Pretest scale: confidence with science 0.029 0.101 16 -0.28 780 
content 
fie testceole: confidence in science 0.162 0.178 16 0.91 377 
instruction 
iietest scale: confidence with literacy 0.011 0.148 16 0.08 941 
and discourse 
eee scale: Perceived level of 0.111 0.100 16 1.11 284 
influence 
Pretest scale: Beliefs about teaching Ap 239 0.118 16 -2.02 .060 
Has mentor available at school 0.189 0.282 16 0.67 513 


Note. We do not include estimates for pair fixed effects in the table. 


TABLE 14. ESTIMATES OF LEVEL-1 AND LEVEL-2 VARIANCE COMPONENTS (RANDOM EFFECTS) - 
RETAINED IN STUDY SAMPLE 


Variance Standard 
Random effect component error Z value 
School 211 0.188 lg 131 
Residual (teacher) 305 0.113 315 <.001 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 47 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Appendix J. Sensitivity Analysis for the Analysis of Impact on Intermediate Outcomes 


This appendix presents the sensitivity analyses conducted to assess the robustness of results of analysis of impacts on intermediate 
outcomes. 


Because our priori selected benchmark model (model 1 in Table J1) yields an estimate of zero for the school-level random effect, as part 
of the sensitivity analysis, we remove pair effects altogether to free up variance to allow the school variance component to be estimated 
(model 2). A result of zero variance with no p value means the estimation procedure has reached a boundary condition for estimating 
the corresponding effect (Singer & Willett, 2003), often implying that the variance component is trivially different from zero. However, 
we prioritized including the school level in analysis because schools are the unit of random assignment. As expected, with this change, 
in most cases, the school variance component becomes estimable. However, by excluding block effects, our impact estimates are less 
precise, with several of the results no longer reaching statistical significance (comparing model 1 to model 2). However, we also see that 
magnitudes of the impact estimate do not fluctuate much between model 1 and model 2, indicating that reaching the boundary 
condition in estimating the school effect in model 1 is not inducing any major bias in the impact estimates. In further assessing the 
sensitivity of the benchmark result, we evaluate impact using the benchmark model specification but with restricted maximum 
likelihood estimation instead of full maximum likelihood (model 3). Many school-level variance components become estimable, but 
there is an accompanying loss of precision. Several results that were statistically significant under model 1 ceased to be so with other 
models (models 4 and 5). In total, we show results from five approaches to modeling impact. 


TABLE J1. MODELS FOR ASSESSING IMPACT ON INTERMEDIATE OUTCOMES 


School effects Pair effects Restricted maximum likelihood 
Model 1 Random Fixed Maximum likelihood 
Model 2 Random -- Maximum likelihood 
Model 3 Random Fixed Restricted 
Model 4 Random Random Restricted 
Model 5 Random -- Restricted 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 48 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE J2. RESULTS FOR MODELING IMPACT ON INTERMEDIATE OUTCOMES 


Point 
est. 


-0.085 


0.214 


0.022 


0.208 


0.204 


0.187 


0.110 


0.392 


1.746 


0.518 


O.593 


Model 1 


fe) 
value 


meh, 


215 


dia 


es 


074 


.206 


285 
.020 


016 


019 


.003 


Ya YoYo) 
var. 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


Point 
est. 


-0.059 


239 


0.026 


O.198 


0.249 


0.312 


0.091 


0.410 


1.075 


O23/ 


0.342 


i Kexe K-14 


fe) 
value 


673 


236 


152 


12 


062 


ess 


A65 
014 


N71 


362 


134 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 


Yo YoYo) 
var. 


0.000 


0.093 


0.002 


0,029 


0.035 


0.000 


0.046 


0.038 


0.770 


0.187 


0.066 


Point 
est. 


-0.085 


0.214 


0.010 


0.205 


0.204 


0.187 


0.110 


(359 


ie 


0.518 


0.593 


I Rexe K=) fd 


p 
value 


ro Re) 


349 


G27 


202 


164 


320 


A05 


10g 


087 


067 


022 


Yel YoYo) 
Var. 


0.000 


0.000 


0.029 


0.040 


0.000 


0.000 


0.000 


0.0973 


1136 


0.000 


0.000 


xe) [ale 
est. 


-0.063 


0.244 


0.025 


0.201 


0.237 


0.312 


0.088 


0.407 


1.045 


0.300 


0.392 


Model 4 

fe) School 
value Var. 
679 0.000 
289 0.134 
189 O01] 
238 0.057 
.074 0.000 
059 0.000 
A84 0.009 
033 0.067 
246 1.729 
249 0.003 
095 0.000 


Pair 


0.004 


0.089 


0.000 


0.064 


0.012 


0.000 


0.243 


0.199 


Point 
est. 


0.201 


O75 


G32 


0.091 


0.410 


1.045 


i238 


0.237 


Model 5 

fe) School 
value var. 
706 0.000 
287 0.151 
785 0.016 
237 0.061 
.100 0.060 
059 0.000 
521 0.069 
031 0.077 
246 1.729 
A428 0.276 
196 0.135 

49 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE J2. RESULTS FOR MODELING IMPACT ON INTERMEDIATE OUTCOMES 


Model 1 i Kexe K-14 WY Rexe K=) fs Model 4 Model 5 


fe) School Point fe) School Point fe) School fe) School Pair Point fe) School 
value var. est. value Var. est. value var. 5 value Var. est. value Var. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 50 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE J2. RESULTS FOR MODELING IMPACT ON INTERMEDIATE OUTCOMES 


Point 
est. 


07503 


-0.351 


0.464 


-0.133 


0.602 


OF07 


O.573 


OAS 


0.048 


0.070 


0.328 


Model 1 


fe) 
value 


als 


471 


A52 


674 


349 


9S 


.000 


a5 


084 


644 


OS7. 


School 
var. 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


Point 
est. 


0.118 


-0.944 


-0.205 


-0.455 


0.670 


0.655 


0.594 


0.049 


-0.103 


0.044 


0.281 


i Kexe K-14 


fe) 
value 


B11 


090 


780 


252 


00) 


.209 


000 


748 


00 


Mei) 


099 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 


Yel YoYo) | 
var. 


0.000 


0.981 


2612 


0.682 


1.097 


0.000 


0.019 


0.086 


0.042 


0.002 


0.000 


Point 
est. 


0.497 


-0.452 


0.413 


-0.182 


0.602 


0.907 


O.072 


OT7S 


0.037 


0.046 


0.328 


WY Rexe f=) es 


p 
value 


447 


024 


674 


/16 


465 


10 


.001 


279 


ade 


828 


0S 


Yel YoYo) 
var. 


0.144 


4.135 


1.072 


0.000 


0.000 


0.049 


0.003 


U.037 


0.084 


0.000 


xe) [ale 
est. 


0.136 


-0.950 


-0.204 


-0.454 


0.634 


0.855 


0.590 


0.067 


-0.090 


0,037 


0.265 


Model 4 

fe) School 

value var. 
809 = =0.154 
ola ose 
804 = 3.301 
.286 0.866 
392 0.080 
.261 ~~ 0.000 
000 80.037 
667 =—0.041 
443, 0.042 
8a5- (000) 
155. 0,000 


Pair 


0.000 


0.000 


0.000 


1.629 


0.000 


0.000 


0.079 


0.020 


0.047 


0.093 


Point 
est. 


0.136 


-0.950 


-0.204 


-0.454 


0.666 


(L655 


0.590 


0.047 


-0.104 


032 


0.278 


Model 5 


p 
value 


809 


alls 


804 


286 


A413 


261 


.000 


784 


394 


865 


150 


School 
var. 


O35 


1.372 


3.501 


0.866 


1729 


0.000 


0.037 


OTs 


0.057 


0.051 


0.008 


51 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE J2. RESULTS FOR MODELING IMPACT ON INTERMEDIATE OUTCOMES 
Model 1 i Kexe K-14 WY Rexe K=) fd Model 4 Koyo =1 pes) 


fe) School Point fe) School Point fe) School Point fe) School Pair Point fe) School 


value var. est. value Var. est. value var. est. value Var. Var. est. value Var. 


0.056 


800 


0.000 0.048 


J27 0.106 


0.128 .383 0.000 0.054 742 0.0017 0.128 494 0.000 


0.075 


384 


O.1.38 


0.063 


0.019 


sei 


211 0.000 


O.190 


0.050 


Ola? al? 


111 0.000 


0.190 


Note. p values < .05 are highlighted in red. Est. = Estimate; Var. = Variance; NGSS = Next Generation Science Standards; PS = Physical Science; DC! = Disciplinary Core Ideas; ES 
= Earth and Space Science; SEP = Science and Engineering Practices; CCC = Crosscutting Concepts 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX De 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Appendix K. Hierarchical Linear Model Associated with the Confirmatory Impacts on 
Student Science Achievement (Selected-Response Items) 


This appendix presents the hierarchical linear model used to evaluate impact on student science 
achievement (Equation K1). 


Q R s T 
Vijx = Bo + bitreatment, + > Ag Aha ». Vr Zier + ». as Zs + ». BLOCK; Des + && + Eijx 
q=1 T=1 S=1 t=1 


(K1) 


Yijx is the outcome for student 7 belonging to the class of teacher j (in the 2017/18 school year) in 
school k. Treatments is a binary variable at the school level, with 0 indicating assignment to control 
and 1 indicating assignment to Making Sense of SCIENCE. The effect of the intervention is assessed in 
terms of the statistical significance of the estimate of 61. The model includes effects of covariates at the 
students level (A), at the teacher level (y,.), and at the school level (@,) as well as fixed effects for 
randomized blocks (we assume T blocks with BLOCK; taking on the value 1 if school k is in block t 
and 0 otherwise.) ¢, and €;j;, represent school- and student-level random effects, respectively. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 52 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Appendix L. Full Estimates of the Benchmark Impact Model for the Confirmatory 
Analysis of Impacts on Student Science Achievement (Selected-Response Items) 


This appendix provides the full estimates of the benchmark impact model for the confirmatory 
analysis of impact on student science achievement (full sample N = 2,140) as measured by selected- 
response items on the student science assessment. 


TABLE L1. ESTIMATES OF FIXED EFFECTS 


Fixed effects Coefficient Standard error t-ratio df p value 
-0.675 0.489 -1.379 23 181 
0.062 0.089 0.696 20 A9IA4 
0.000 0.000 0.244 29 809 
0.115 165 0.696 23 493 
-0.035 0.230 -0.152 23 880 
0.305 0.394 0.774 23 447 
0.318 0.028 11,207 2049 <.001 
0.266 0.028 9.460 2049 <.001 
-0.031 0.043 -0.731 2049 465 
0.064 0.033 LP1S 2049 056 
-0.130 0.046 -2.810 2049 005 
-0.012 0.067 -0.183 2049 855 
-0.160 0.066 -2.412 2049 016 
-0.097 0.053 -1.837 2049 066 
-0.135 0.183 -0.739 2049 460 
-0.229 0.166 “1.379 2049 .168 
O.015 0.068 0.222 2049 824 
0.076 0.050 Log 2049 129 
0.185 0.101 1.836 2049 066 
-0.018 0.055 -0.334 2049 30 
0.353 0.158 2200 2049 025 
0.248 0.095 2.603 2049 009 
0.097 0.058 1.673 2049 094 
0.020 0.036 0.562 2049 0/4 
-0.218 0235 -0.927 2049 354 
0.001 0.003 0.407 2049 684 
0.168 O2tt2 1.502 2049 133 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 54 


EFFECTIVENESS OF WESTED’S MAKING SENSE OF SCIENCE 


TABLE L1. ESTIMATES OF FIXED EFFECTS 


Fixed effects Coefficient Standard error t-ratio df p value 

Missing Teacher pretest 0.442 0.149 2.965 2049 003 
Teacher content pretest 0.749 O21/ 3.450 2049 <.001 
Missing Survey -0.522 0.250 -2.090 2049 037 
NGSS Missing -0.010 0.103 -0.096 2049 924 
Teacher's gender -0.285 0.067 -4.256 2049 <.001 
Taught Science in previous year 0.058 0.079 A2F 2049 467 
gee tenon yrn neuer 0.074 0.058 4.282 2049 200 
instruction 

Composite er eeteg context culture 0.091 0.061 1.496 5049 135 
between admins and teachers 

Composite of school context culture 0.043 0.058 074A 049 ‘A57 


among teachers 
NGSS-related activities participated in 0.034 0.016 2.185 2049 029 


Composite of confidence on specific 


, -0.063 0.037 -1.700 2049 089 
science content 
pombooe of confidence in literacy 0.024 0.039 0.612 049 5A 
and discourse 
Sea of perceived level of 0.024 0.037 0.639 5049 523 
influence 
Composite of teaching philosophies -0.080 0.049 -1.637 2049 102 
Having coachers or mentors for 0.056 0.046 0.844 5049 399 


science instruction 


Note. We do not include estimates for pair fixed effects in the table. 


CA = California; ELA = English Language Arts; ELL = English Language Learner; FRPL = Free or Reduced Price Lunch; NGSS = Next 
Generation Science Standards 


TABLE L2. ESTIMATES OF LEVEL-1 AND LEVEL-2 VARIANCE COMPONENTS (RANDOM EFFECTS) 


Variance 
Random effect Standard deviation component Chi-squared 
School 0.236 0.056 Zo 106.194 <.001 
Student 0.739 0.546 


Note. The analysis was conducted using HLM software, which does not provide df, test statistic, and p value at level-1. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX oo 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Appendix M. Sensitivity Analyses for the Confirmatory Impacts on Student Science 
Achievement (Selected-Response Items) 


This appendix presents the results of the sensitivity analyses for the confirmatory impacts on student 
science achievement as measured by the selected-response items on the science assessment. The 
sensitivity analyses include scores derived from different score calibration approaches (percent- 
correct, 1PL, 2PL, and 3PL) and different model specifications. Results in Tables M1 and M2 are based 
on the score calibrations that included all items. Results in Table M3 are based on “reduced item” 
forms where we excluded items with factor loading less than .20 on the principal dimension. Scores 
are calculated as percent-correct. 


TABLE M1. IMPACT FINDINGS BASED ON 3PL SCALING WITH ALTERNATIVE MODELS 
N (SCHOOLS) = 55, N (STUDENTS) = 2,140 


f) Effect Change 
Impact (SE) t df value size percentile rank 


Benchmark analysis full covariate set 2 0.062 (.089)  .696 23 A494 064 2.5% 
Sensitivity Analysis (Alternative Models) 

Like benchmark, no covariates, no blocks -0.044 (.102)  -.432 52 667 -.045 -1.8% 
Like benchmark, no covariates 0.004 (.089) 0.041 26 967 004 0.1% 
Like benchmark, pretests are only covariates 0.050 (.067) 751 26 A59 052 2.1% 
Include teacher random effect 0.065 (0.091) 0.72 23 478 0.067 2.7% 
Use pair random (instead of fixed) effects 0.022 (0.075) 0.29 23 dio 0.023 0.9% 
Ignore pair level 0.022 (0.075) 0.29 49 Jie 0.023 0.9% 
Ignore school, model random intercept and 0.024 (0.075) 0.32 22 156 0.025 1.0% 
treatment at pair level 

Use ML instead of REML 053 (.054) 980 23 339 0.055 2.2% 
OLS 051 (.048) 1.06 2072 .288 0.053 2.1% 
Multiple Imputation to address missing » 069 (.080) 66° 519.35 890 071 2.9% 


Note. Most covariates are modeled at the student level. 
OLS is Ordinary Least Squares. ML is Full Maximum Likelihood. REML is Restricted Maximum Likelihood. 


N/Kevelaln el-sare) saat lalecmlamnalomager-leenl-laime]gol0| OV 7-M CAC ONO alin MMOOR-to)Mlamual-mac-r-liagi-lalace] cel] ofm- Isle ne ONOVMUlaviicm (OA, ayo) Mlamuat-meo alice) 
olcoleor 


4 Full results of benchmark model are provided in Appendix L. 


> The sample consisted of 2,544 students. This included the 2,140 students used in the benchmark analysis plus students with spring 
2018 posttests who had been excluded because they were missing one of the two pretests. Therefore, all students have posttests 
and some may be missing one or both pretests. I mputation is of all missing covariates including the pretests. Students without 
posttests are listwise deleted. The imputation regression model included an indicator variable for intervention status, included all 
covariates that were used for statistical adjustment in the impact estimation model, and included the outcome when imputing 
missing baseline data. Results were based on 10 round of imputation. Each analysis adjusted for the nesting of individual outcomes 
in schools. Analysis was conducted using PROC MI and PROC MIANALYZE in SAS. PROC MIXED was used for each imputation cycle. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 56 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE M2. IMPACT FINDINGS BASED ON PERCENT-CORRECT, 1PL, 2PL AND 3PL SCALING 
ALTERNATIVE MODELS N (SCHOOLS) = 55, N (STUDENTS) = 2,140 


Effect Change 
Impact (SE) t df p value size percentile rank 

Random Pair, No covariates, %correct -.020 (.084) -0.237 26 814 -0.021 -0.8% 
Random Pair, No covariates, 1 PL -.019 (.084) -.222 26 826 -0.020 -0.8% 
Random Pair, No covariates, 2 PL -.019 (.087) -217 26 830 -0.020 -0.8% 
Random Pair, No covariates, 3 PL -.027 (.086) -.310 26 AD? -0.028 -1.1% 
Random Pair, pretest only covariate, 038 (.057) 668 26 510 0.039 1.6% 
%correct 
Random Pair, pretest only covariate, 1 PL 039 (.057) 688 26 A97 0.040 1.6% 
Random Pair, pretest only covariate, 2 PL 043 (.059) 139 26 A467 0.044 1.8% 
Random Pair, pretest only covariate, 3 PL .036 (.060) .600 26 554 0.037 1.5% 
Random Pair, all covariates, Ycorrect 015 (.061) 250 23 804 0.015 0.6% 
Random Pair, all covariates, 1 PL .020 (.060) Peis) 23 141 0.021 0.8% 
Random Pair, all covariates, 2 PL .034 (.066) 210 23 i615 0.035 1.4% 
Random Pair, all covariates, 3 PL 026 (.068) 386 23 703 0.027 1.1% 
Fixed Pair, No covariates, %correct 010 (.084) 118 26 907 0.010 0.4% 
Fixed Pair, No covariates, 1 PL .012 (.084) 138 26 891 0.012 0.5% 
Fixed Pair, No covariates, 2 PL 011 (.088) he? 26 900 0.011 0.4% 
Fixed Pair, No covariates, 3 PL .004 (.089) 041 26 76] 0.004 0.2% 
Fixed Pair, pretest only covariate, 052 (.061) B49 26 A404 2.2% 
%correct 0.054 
Fixed Pair, pretest only covariate, 1 PL 053 (.061) 878 26 388 0.055 2.2% 
Fixed Pair, pretest only covariate, 2 PL 057 (.065) 886 26 384 0.059 24% 
Fixed Pair, pretest only covariate, 3 PL 050 (.067) 751 26 A59 0.052 2.1% 
Fixed Pair, all covariates, %correct 065 (.082) 790 23 A38 0.067 2.7% 
Fixed Pair, all covariates, 1 PL .067 (.082) 824 Z3 A19 0.069 2.8% 
Fixed Pair, all covariates, 2 PL .077 (.087) 883 23 386 0.079 3.1% 
Fixed Pair, all covariates, 3 PL 062 (.089) 696 22 494 0.064 2.5% 


Note. Most covariates are modeled at the student-level. 


1/2/3 PL = 1/2/3-Parameter Logistic 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 57 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE M3. IMPACT FINDINGS BASED ON THE PERCENT-CORRECT METRIC WITH REDUCED-ITEMS 
FORMS 
N (SCHOOLS) = 55, N (STUDENTS) = 2,140 


Change 


Effect percentile 
size rank 


0.078 (0.091) 0.86 Zo 0.398 0.080 3.2% 


-0.033 (0.100) -0.33 Do 0.743 -0.034 1.4% 
0.011 (0.086) OslZ 26 0.903 0.011 0.5% 
0.056 (0.064) 0.88 26 0.388 0.058 2.3% 
0.078 (0.091) 0.85 23 0.402 0.080 3.2% 
0.035 (0.072) 0.49 20 0.632 0.036 1.4% 
0.035 (0.072) 0.49 49 0.629 0.036 1.4% 
0.039 (0.072) 0.52 Zo 0.600 0.040 1.6% 
0.065 (0.053) ls22 2 0.234 0.067 2.7% 


0.062 (0.049) 1.29 


Note. Most covariates are modeled at the student level. 


NAKexelaln eX-sacel saat lalecmlaMual-mag-t-eent-1aime|cole)] OM 7M ORV] alice UMOOM-vo) Miamual-Mic-t-ldaal-lalae|cel0lovmr-lale R= ORO MU laliicm (OM a-to) Ml amdal-mee)aiice) Re |cele] on 


OLS is Ordinary Least Squares. ML is Maximum Likelihood. REML is Restricted Maximum Likelihood. 


4 Full results of benchmark model are provided in Appendix L. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 58 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Appendix N. Full Estimates of the Benchmark Impact Model for the Confirmatory 
Analysis of Impacts on Science Achievement of Students in the Lowest Third of 
Incoming Achievement 


TABLE N1. ESTIMATES OF THE BENCHMARK IMPACT MODEL FOR STUDENTS IN THE LOWEST THIRD 
OF INCOMING ELA ACHIEVEMENT (N = 715) 


Fixed effects Coefficient | Standard error t-ratio df p value 
-1.773 0.708 -2.504 23 1020 
0.054 0.093 0.581 23 07 
0.000 0.000 0.103 23 319 
0.138 0.174 0.793 23 A436 
-0.038 0.249 -0.153 23 879 
0.728 0.447 1.627 23 mn, 
0.022 0.067 0823 624 747 
0.248 0.040 6.143 624 <.001 
-0.026 0.077 -0.342 624 foo 
-0.030 0.055 -0.554 624 .580 
-0.046 Ocy2 -0.641 624 oe 
-0.036 O131 -0.277 624 182 
-0.070 0.111 -0.635 624 25 
-0.070 0.102 -0.687 624 A92 
0.082 0.331 0.249 624 .803 
-0.335 0.269 -1.246 624 213 
0.171 0.132 1.298 624 195 
-0.035 0.092 -0.377 624 706 
0.214 0.202 1.058 624 290 
-0.116 0.084 -1.391 624 N65 
0.151 0.245 0.617 624 37 
0.131 0.181 0.723 624 470 
0.087 0.091 0.954 624 341 
-0.029 0.063 -0.460 624 646 
-0.241 0.385 -0.625 624 G2 
0.008 0.005 1.664 624 097 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX oY 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE N1. ESTIMATES OF THE BENCHMARK IMPACT MODEL FOR STUDENTS IN THE LOWEST THIRD 
OF INCOMING ELA ACHIEVEMENT (N = 715) 


Fixed effects Coefficient Standard error t-ratio df p value 
Piles lDee ve areiehsiaehe am -0.080 0.162 0.494 624 621 
teaching 
Missing Teacher pretest 0.637 0.232 2.747 624 006 
Teacher content pretest i Paes O33 3.438 624 <.001 
Missing Survey 0.203 0.380 U.5oo 624 wee 
NGSS Missing 0.120 0.143 0.836 624 A404 
Teacher's gender -0.244 0.100 -2.440 624 41.5 
Taught Science in previous year 0.000 0.122 -0.001 624 299 
NG Eeme ua heumescts cena 0.013 0.093 0.135 624 893 
instruction 
Composite of school context 
culture between admins and G.027 0.095 0.287 624 J74 
teachers 
Composite of school context 0.016 0.088 0.176 2A 261 
culture among teachers 
ae ea lara 0.044 0.026 1.709 624 088 
participated in 
Se PSelee 2) conten zen 0.077 0.060 1.275 624 203 
specific science content 
Composite of. contidence)in 0.006 0.067 0.0% 624 924 
literacy and discourse 
Eompesite of perceived level of 0.031 0.058 0.541 2A 589 
influence 
Come crire ch tsecting 0.067 0.069 0.959 624 338 
philosophies 
Having coaches or mentors for 0.086 0.112 0.772 624 AAO 


science instruction 


Note. We do not include estimates for pair fixed effects in the table. 


CA = California; ELA = English Language Arts; ELL = English Language Learner; FRPL = Free or Reduced Price Lunch; NGSS = Next 
Generation Science Standards 


TABLE N2. ESTIMATES OF LEVEL-1 AND LEVEL-2 VARIANCE COMPONENTS (RANDOM EFFECTS) 


Random 
effect Standard Deviation Variance Component Chi-squared 
School 0.150 0.022 Zo 26.922 20? 
Student 0.689 0.475 


Note. The analysis was conducted using HLM software, which does not provide df, test statistic, and p value at level-1. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 60 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE N3. ESTIMATES OF THE BENCHMARK IMPACT MODELS FOR STUDENTS IN THE LOWEST THIRD 
OF INCOMING MATH ACHIEVEMENT (N=713) 


Fixed effects Coefficient | Standard error t-ratio df p value 
Intercept -0.923 0.705 -1.309 23 0.203 
Treatment Status 0.162 0.094 1.718 23 0.099 
School Size 0.000 0.000 -1.07 23 0.296 
School in City 0.101 OA73 0.585 ZS 0.564 
Title 1 Status -0.655 0.264 -2.479 Z23 0.021 
State CA O.379 0.454 0.834 2a 0.413 
ELA pretest O21 0.049 4.331 622 <0.001 
Math pretest O13) 0.056 2.348 622 Oly 
Grade 4 0.034 0.077 0.446 622 0.655 
Male 0.019 0.055 0.345 622 0.730 
ELL -0.039 0.071 -0.541 622 0.589 
Asian 0.125 0.135 0.927 622 0.354 
Black -0.082 0.111 -0.738 622 0.461 
Hispanic -0.063 0.101 -0.625 622 0.532 
Native Indian 0.158 0.368 0.43 622 0.668 
Gender, ELL & Ethnicity Missing -0.253 0.279 -0.907 622 0305 
Ethnicity Unspecified 0.010 0.136 0.074 622 0.941 
FRPL Eligible 0.102 0.094 1.09 622 0.276 
FRPL missing 0.278 0.197 1.414 622 0.158 
With White teacher -0.030 0.082 -0.365 622 0.715 
Teacher ethnicity missing 0.148 0.227 0.65 622 0.516 
Certificate in elementary education 0.280 0.164 1.707 622 0.088 
aes EnglehEanguage 0.081 0.092 0.873 622 0.383 
Highest level of education -0.042 0.065 -0.65 622 0.516 
Missing Highest level of education -0.049 0.372 -0.132 622 U075 
Years of classroom teaching 0.004 0.004 O929 622 ORG) 
Missing in years of classroom teaching 0.021 0.183 0.113 622 0.910 
Missing Teacher pretest 0.455 0.240 13895 622 0.059 
Teacher content pretest 0.773 0.346 2.232 622 0.026 
Missing Survey 0.095 0.401 0.236 622 0.813 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 61 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE N3. ESTIMATES OF THE BENCHMARK IMPACT MODELS FOR STUDENTS IN THE LOWEST THIRD 
OF INCOMING MATH ACHIEVEMENT (N=713) 


Fixed effects Coefficient Standard error t-ratio df p value 
NGSS Missing 0.132 0.149 0.882 622 0.378 
Teacher's gender -0.306 0.103 -2.969 622 0.003 
Taught Science in previous year 0.055 0.125 0.442 622 0.659 
Net Coo vance tolezclece 0.046 0.096 0.481 622 0.631 


instruction 


Composite of school context culture 
between admins and teachers 


0.165 0.100 1.655 622 0.098 


Composite of school context culture 
among teachers 


NGSS-related activities participated in 0.028 0.025 1.144 622 0.253 


-0.100 0.075 -1.052 622 O.293 


Composite of confidence on specific 


. -0.052 0.062 -0.848 622 C277 
science content 
caliphs of confidence in literacy 0.028 0.068 0.413 622 0.680 
and discourse 
Eompesite of perceived level of 0.013 0.058 0.229 622 0.819 
influence 
Composite of teaching philosophies -0.003 0.073 -0.037 622 C971 
feng SOCIO: or mentors for science 0.025 0.111 0.224 622 0.823 
instruction 


Note. We do not include estimates for pair fixed effects in the table. 


CA = California; ELA = English Language Arts; ELL = English Language Learner; FRPL = Free or Reduced Price Lunch; NGSS = Next 
Generation Science Standards 


TABLE N4. ESTIMATES OF LEVEL-1 AND LEVEL-2 VARIANCE COMPONENTS (RANDOM EFFECTS) 


Random 
effect Standard Deviation Variance Component Chi-squared 
School 0.165 0.027 23 29 OF 180 v.16 
Student 0.689 0.475 


Note. The analysis was conducted using HLM software, which does not provide df, test statistic, and p value at level-1. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 62 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Appendix O. Sensitivity Analyses for the Confirmatory Impacts on Science 
Achievement of Students in the Lowest Third of Incoming Achievement 

This appendix presents the sensitivity analyses for confirmatory impacts on student science 
achievement for students in the lowest third of incoming ELA and math achievement. Student 
science achievement is measured using selected-response items on the science assessment. The 
sensitivity analyses include scores derived from different score calibration approaches (percent- 
correct, 1PL, 2PL, and 3PL) and different model specifications. 


ELA PRETEST 


TABLE O01. IMPACT FINDINGS BASED ON 3PL SCALING ALTERNATIVE MODELS 
N (SCHOOLS = 55, N (STUDENTS) = 715 


Change 
Effect percentile 

Impact (SE) size rank 
Benchmark analysis full covariate set @ .054 (.093) 0.581 20 567 073 2.9% 
Sensitivity Analysis (Alternative Models) 
Like benchmark, no covariates, no blocks 072 (.068) 1.061 Bs 293 098 3.9% 
Like benchmark, no covariates 094 (.069) Lge 26 .180 128 5.1% 
Like benchmark, pretests are only covariates 083 (.066) 1.254 26 22 | mi Pe 4.5% 
Include teacher random effect 0.054(0.093) 0.58 24 565 0.073 2.9% 
Use pair random (instead of fixed) effects 0.022 (0.072) 0.30 24 766 0.030 1.2% 
Ignore pair level 0.022 (0.072) 0.30 50 764 0.030 1.2% 
SI ae eee intercept ane 0.022(0.072) 031 22 760 0.030 1.2% 
Use ML instead of REML 0.047 (0.074) 0.64 24 .530 0.064 2.5% 
OLS 0.047 (0.078) 0.61 647 545 0.064 2.5% 


Note. Most covariates are modeled at the student level. Mean performance in the treatment group was -0.607 units (.747 sd) in the 
idcr-lianl-lalame cole] om laloneONelsisMU] allie (OW M4cN-ro) Ml amial-meel alice) me|col0| on 


OLS is Ordinary Least Squares. ML is Full Maximum Likelihood. REML is Restricted Maximum Likelihood. 


mel ig-s30] Moya o)-lalolalani-19 aanrete|-1m-1c-m olce)lo\-ro Naw -\0)6\-1alelh aN 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 63 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE O2. IMPACT FINDINGS BASED ON PERCENT-CORRECT METRIC WITH REDUCED-ITEMS FORMS 
N (SCHOOLS) = 55, N (STUDENTS) = 715 


Change 
Effect percentile 

Impact (SE) size rank 
Benchmark analysis full covariate set 2 0.041 (0.093) 0.44 24 662 0.056 2.2% 
Sensitivity Analysis (Alternative Models) 
Like benchmark, no covariates, no blocks 0.061 (0.067) 0.91 Do 369 0.083 3.3% 
Like benchmark, no covariates 0.084 (0.063) 1a 26 196 CS 46% 
Like benchmark, pretests are only covariates 0.070 (0.061) Tale 26 .260 0.096 3.8% 
Include teacher random effect 0.042 (0.094) 0.45 24 659 0.057 2.3% 
Use pair random (instead of fixed) effects 0.030 (0.069) 0.45 23 658 0.041 1.6% 
Ignore pair level 0.031 (0.069) 0.45 50 655 0.042 1.7% 
apne nna tall intercepand 0.031 (0.069) 045 23 655 0.042 17% 
Use ML instead of REML 0.038 (0.073) 0.52 24 .608 0.052 2.1% 
OLS 0.038 (0.077) 049 647 621 0.052 2.1% 


Note. Most covariates are modeled at the student level. Mean performance in the treatment group was -0.671 units (0.745 sa) in the 
idcex-ldanl-lalame|cele] ov lploe O Roh amU] allie (OW M40R-yo) Ml amial-mee) alice) me|col] on 


OLS is Ordinary Least Squares. ML is Full Maximum Likelihood. REML is Restricted Maximum Likelihood. 


SUI MC-X0 1 «Moy mo)-lalolalanr-1a aanree|-1m-1c-m olf) [e\-re Maw Ve) 6\-1alelh aN 


MATH PRETEST 


TABLE O3. IMPACT FINDINGS BASED ON 3PL SCALING ALTERNATIVE MODELS 
N (SCHOOLS) = 55, N (STUDENTS) = 713 


Effect Change 
Impact (SE) t df p value size percentile rank 

Benchmark analysis full covariate set @ 1620094) 1718 23 099 eu 8.7% 
Sensitivity Analysis (Alternative Models) 

Like benchmark, no covariates, no blocks .060 (0.078) 0.769 53 A45 081 3.2% 
Like benchmark, no covariates 092 (0.074) 1.246 26 224 size 5.0% 
Like benchmark, pretests are only covariates O91 0.073) lizbe 26 220 mi Pe) 4.8% 
Include teacher random effect 0.151 (0.096) 1.57 23 Az? 0.204 8.1% 
Use pair random (instead of fixed) effects 0.052 (0.071) 0.73 23 473 0.070 2.8% 
Ignore pair level 0:052 (0.071) 0.73 49 A69 0.070 2.8% 
Ignore school, model random intercept and 0.053 (0.072) 0.74 92 Ab5 0.072 7.9% 


treatment at pair level 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 64 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE O3. IMPACT FINDINGS BASED ON 3PL SCALING ALTERNATIVE MODELS 
N (SCHOOLS) = 55, N (STUDENTS) = 713 


Effect Change 
Impact (SE) t df p value size percentile rank 


Use ML instead of REML 0.147 (0.073) 2.01 23 056 0.200 7.9% 


OLS 0.147 (0.077) 1.92 645 056 0.200 7.9% 


Note. Most covariates are modeled at the student level. Mean performance in the treatment group was -0.600 units (.758 sd) in the 
iucer-lianl-lalaxe |<ole] om lplome ORoYASMUl allie (OW M40R-ro) Ml amual-meo) alice) me|col] on 


OLS is Ordinary Least Squares. ML is Full Maximum Likelihood. REML is Restricted Maximum Likelihood. 


4 Full results of benchmark model are provided in Appendix N. 


TABLE O4. IMPACT FINDINGS PERCENT CORRECT METRIC WITH REDUCED-ITEMS FORMS 
N (SCHOOLS) = 55, N (STUDENTS) = 713 


Effect Change 
Impact (SE) t df p value size percentile rank 

Benchmark analysis full covariate set 2 0.162 (0.102) 1.58 24 127 0.220 8.7% 
Sensitivity Analysis (Alternative Models) 

Like benchmark, no covariates, no blocks 0.070 (0.080) 0.88 5a 00 0.095 3.7% 
Like benchmark, no covariates 0.088 (0.080) 1.11 26 216 0.120 4.8% 
Like benchmark, pretests are only covariates 0.089 (0.078) 1.14 26 .265 0.121 4.8% 
Include teacher random effect 0.162 (0.102) 1.58 24 A27 0.220 8.7% 
Use pair random (instead of fixed) effects 0.074 (0.073) 1.01 23 323 0.101 4.0% 
Ignore pair level 0.074 (0.073) 1.01 49 318 0.101 4.0% 
SE ea ea oes intercept and 0.076 (0.073) 1.04 23 309 0.104 4.1% 
Use ML instead of REML 0.146 (0.073) 2.00 24 569 0.200 7.9% 
OLS 0.146 (0.077) 1.90 645 058 0.200 7.9% 


Note. Most covariates are modeled at the student level. Mean performance in the treatment group was -0.671 units (0.737 sa) in the 
iugey-daal-1alaxe]cel0] ove lato me ORotsONU] allie OAc OR-vo) il amsal-mexe) alice) me|col] oy 


OLS is Ordinary Least Squares. ML is Full Maximum Likelihood. REML is Restricted Maximum Likelihood. 


4 Full results of benchmark model are provided in Appendix N. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 65 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Appendix P. Sample Sizes and Baseline Equivalence for the Impact on Student Science 
Achievement for Specific Subsamples 


This appendix presents the sample sizes and baseline equivalence for the impact on student science 
achievement (selected-response items) for specific subsamples (Focused Samples 1 and 2, by state, 
and by grade). 


FOCUSED SAMPLE 1 


The sample included 1,415 students (719 treatment, 696 control) who had both grade 3 state ELA and 
math pretests, with 814 students from California and 601 students from Wisconsin. Counts are shown 
in Table P1. 


TABLE P1. FOCUSED SAMPLE 1 


Count of schools in Count of students Count of schools in Count of students with 
CA with posttest in CA posttest in WI 


Note. MSS stands for the group of students of teachers receiving the Making Sense of SCIENCE program. CA is California. WI is 
Wisconsin. Posttest is the student science achievement assessment. 


We tested baseline equivalence for (a) ELA pretest, (b) math pretest. Results are in Table P2. We 
observed that baseline equivalence is established for both ELA pretest and math pretest. 


TABLE P2. TESTS OF BASELINE EQUIVALENCE BETWEEN CONDITIONS ON ELA AND 
MATH FOR FOCUSED SAMPLE 1 


ELA pretest Math pretest 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 66 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


FOCUSED SAMPLE 2 

The sample included 340 students (167 treatment, 173 control) who had both grade 3 state ELA and 
math pretests, with 178 students from California and 162 students from Wisconsin. Counts are shown 
in Table P3. 


TABLE P3. FOCUSED SAMPLE 2 


Count of schools in Count of students Count of schools in Count of students with 
CA with posttest in CA posttest in WI 


Wisconsin. Posttest is the student science achievement assessment. 


We tested baseline equivalence for (a) ELA pretest and (b) math pretest. Results are in Table P4. We 
observed that baseline equivalence is established for both ELA pretest and math pretest. 


TABLE P4. TESTS OF BASELINE EQUIVALENCE BETWEEN CONDITIONS ON ELA AND 
MATH FOR FOCUSED SAMPLE 2 


ELA pretest Math pretest 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 6/ 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


IMPACT BY STATE 


TABLE P5. SAMPLE FOR ANALYSIS OF IMPACT BY STATE 


Count of schools in Count of students Count of schools in Count of students with 
CA with posttest in CA WI posttest in WI 


Note. MSS stands for the group of students of teachers receiving the Making Sense of SCIENCE program. CA is California. WI is 
Wisconsin. Posttest is the student science achievement assessment. 


We tested baseline equivalence for both ELA pretest and math pretest in both California and 
Wisconsin samples. Results are in Table P6. We observed that baseline equivalence is established for 
both the ELA pretest and the math pretest in both California and Wisconsin samples. 


TABLE P6. TESTS OF BASELINE EQUIVALENCE BETWEEN CONDITIONS ON ELA AND MATH FOR 
SAMPLE USED IN ANALYSIS OF IMPACT BY STATE 


California sample Wisconsin sample 


ELA pretest Math pretest ELA pretest Math pretest 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 68 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


IMPACT BY GRADE 
TABLE P7. SAMPLE FOR ANALYSIS OF IMPACT BY GRADE 


Grade 4 [CT e-ve(-Ws} 


Count of students Count of students with 
Count of schools with posttest Count of schools posttest 


Note. MSS stands for the group of students of teachers receiving the Making Sense of SCIENCE program. 


We tested baseline equivalence for both ELA pretest and math pretest in both Grade 4 and Grade 5 
samples. Results are in Table P8. Baseline equivalence is established for both the ELA pretest and 
math pretest in both Grade 4 and Grade 5 samples. 


TABLE P8. TESTS OF BASELINE EQUIVALENCE BETWEEN CONDITIONS ON ELA AND MATH FOR 
SAMPLE USED IN ANALYSIS OF IMPACT BY STATE 


Grade 4 sample Grade 5 sample 


ELA pretest Math pretest ELA pretest Math pretest 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 69 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Appendix Q. Supplemental Analysis on the Impact on Student Science Achievement 
under High Fidelity of Implementation 


There are several alternatives for evaluating the impact of a program under the condition of high 
fidelity of implementation. We adapted an approach by Unlu et al. (2010). Assessing impact on 
student achievement under high fidelity of implementation (FOI) required following these steps. 


1. Specify a rule for identifying teachers who are above a specific threshold of actual 
implementation (in the treatment group), whom we refer to as “high implementers.” 

2. Apply a model to predict high implementation in the treatment group using a set of teacher 
baseline covariates. 

3. Apply the model developed under 2 to identify a matched sample of control teachers who 
plausibly would have implemented at the same above-threshold levels had they been 
randomly assigned to treatment. 

4. Assess the impact for students of teachers who are either strongly implementing (in treatment) 
or selected as potentially high implementers using model-based results in (in control). 


While we explored several variants of the method, analysis was fundamentally limited in two ways. 
First, it was difficult to obtain an adequately powered estimate of the relationship between baseline 
(endogenous) characteristics and FOI. FOI was assessed based on attendance in professional learning 
events. There was variability in attendance over the two-year implementation. Attendance was 
determined in large part by assignment of teachers to study-eligible classes. Teachers who joined the 
study late would receive less than full professional learning. While some of the variation on FOI 
could be attributable to teacher-level endogenous factors (including, for example, lower motivation 
leading to late joining), many of the differences in FOI (professional learning attendance) were based 
on mobility resulting from a combination of teacher- and school organizational factors that could not 
be easily captured through surveys of teacher baseline characteristics. Factors affecting joining and 
professional learning (and FOI) levels were likely not sufficiently exogenous to serve as an 
instrument, while also noisy enough that we could not, with precision, relate teacher baseline 
characteristics to FOI outcomes. Essentially, administration-controlled mobility would add a lot of 
noise to the variability in FOL, in a way that would not be predictive of achievement, producing 

a highly underpowered analysis of the effects of dosage. 


To address the first limitation, we limited the sample to teachers in both conditions who remained in 
the study for both years. This eliminates the influence of teacher movement between eligible and non- 
eligible grades/subjects as a source of variance in FOI. This, however, reduced the sample size of 
teachers, and most remaining treatment teachers were fully or close-to-fully compliant with full FOL, 
making it impossible to model the relationship between teacher-endogenous variables and FOI. This 
precludes using the methods above. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 70 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


We therefore relied on analysis of impacts on student science achievement using focused samples 1 
and 2 (see Chapter 6). This approach at least held constant the length of time the teacher spent in the 
study, or the exposure of students to teachers in the study in both conditions. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX a 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Appendix R. Detailed Impact Analysis Findings for the Constructed Response Items 


This appendix presents the details of the analysis of the impact on student communication of science 
in writing (Chapter 8), as measured by constructed-response items on the student science assessment, 
at the item-level. 


TABLE R1. SANDSTONE (N = 449 STUDENTS) GRADES 4 AND 5 


"Koyo f-) ime i Kexe k=) a4 Model 3 2 Model 4 WY Kexe f=) is) WY Koyo f=) ote 


-0.014 (0.029) = 0.002 (0.030) 0.036 (0.029)  -0.004 (0.031) 0.0003 (0.027) 0.043 (0.030) 
p= .704 p= .952 p= .210 p= 898 p= .991 p= .153 


Xx X xX Xx 


Xx Xx 
Xx X Xx 
0.006 0.109 -0.012 0.001 0.130 
002 (.004) .002 (.003) ie 
p= .309 p= .203 
.007 (.005) 002 (.003) 
O* On O° O* 
p= 076 p= .292 
.076 (.005) .076 (.005) .066 (.004) 092 (.006) 073:(.005} .062 (.004) 
p< .001 p<..001 p< .001 p=<=.001 p< .00) p< .001 


Note. Performance on this item was rated in terms of five ordinal categories. Quantities in parentheses are standard errors; All effect 
sizes are regression-adjusted impact estimates divided by the control sd for the outcome distribution for the control group. 


a We also evaluated impact varies across grades for Model 3 and 6. We observed no differential effect across for Model 3 (p = .843) 
or for Model 6 (p = .960) 


Mal ak=MXeiatoto)mismdal-mel ayia rlalelolanly4cven 


Zero effect estimate with no p value indicates that estimation met boundary condition for quantity. This often indicates that the 


quantity is trivially different from zero. 


The difference between treatment and control in the cumulative log odds of a higher-rated response 
for models like 3 and 6, but where we modeled student responses as ordinal, are as follows: 


Model 3: .257 (p = .219), LOR(Cox) = .160 
Model 6: .326 (p = .148), LOR(Cox) = .198 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX f2 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE R2. BASKETBALL (N = 266 STUDENTS) 5'™ GRADE ONLY 


— aiinneareiais _— aia _ 


0.079 (0.034) 0.088 (0.031) 0.073 (0.035) =: 0.091 (0.034) 0.088 (0.032) — 0.109 (0.042) 
p= .020 p = .004 p= 041 p = .009 p = .006 p= .010 


Xx X xX Xx 


X 


002 (.004) 002 (.002) 


p= .249 p= .227 

.0004 (.004) ‘ . b b b 
p= 453 0 0 0 0 0 

.067 (.006) 056 (.005) .050 (.004) .060 (.005) .050 (.004) 044 (.004) 


p <.001 p <.001 p<.001 p <.001 p <.001 p<.001 


Note. Performance on this item was rated in terms of eight ordinal categories (zero and blanks were combined). Quantities in 
parentheses are standard errors; All effect sizes are regression adjusted impact estimates divided by the control sd for the outcome 
oltjag] ol ifo lao) mial- Meo alice) me |col] oF 


ci Ml avoucyoiatole) Mi mnalomel alima-lalelolanlr4qxon 


b Zero effect estimate with no p value indicates that estimation met boundary condition for quantity. This often indicates that the 
quantity is trivially different from zero. 


The difference between treatment and control in the cumulative log odds of a higher-rated response 
for models like 3 and 6 (the main models), but where we modeled student responses as ordinal, are as 
follows: 


Model 3: .492 (p = .083), LOR(Cox) = .298 
Model 6: .827 (p = .021), LOR(Cox) = .501 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 73 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE R3. BIRDFOOD (N = 427 FOURTH- AND FIFTH-GRADES STUDENTS) 


pT ttonets | Modei2_ | Modela+ | Model | Models | Model 6: 


Fixed effects 


0.002 (0.023) 0.009 (0.020) 0.008 (0.022) 0.007 (0.023) 0.006 (0.021) 0.012 (0.023) 


Treatment 


p= .920 p= .673 p= 728 p= ./54 p=./79 p = .604 
Pretests x x x x 
Other Covariates x x 
Matched Pairs X X Xx 
Standardized ES 0.009 0.039 0.035 0.030 0.026 0.052 
Random effects 
; .001 (.001) Os - 
Pair p= 171 0 
School © cers O* O® Oo O° OF 
p= 465 
Siete 052 (.004) 043 (003) .037 (.002) 049 (.003) 041 (.003) .035 (.002) 
ee p <.001 p <.001 p <.001 p <.001 p <.001 p <.001 


Note. Performance on this item was rated in terms of five ordinal categories (zero and blanks were combined). Quantities in 
parentheses are standard errors; All effect sizes are regression adjusted impact estimates divided by the control sd for the outcome 
olfjag] ol nfo lain colmaal- Meo) alice) mel cele] op 


a We also evaluated whether impact varies across grades for Model 3 and 6. We observed no differential effect across for Model 3 (p 


= .580) or for Model 6 (p = .397). 
Wal av-wsvoiarolo)Mismuat-mUl alia tlare(elanlv4-te! 


¢ Zero effect estimate with no p value indicates that estimation met boundary condition for quantity. This often indicates that the 
quantity is trivially different from zero. 


The difference between treatment and control in the cumulative log odds of a higher-rated response 
for models like 3 and 6, but where we modeled student responses as ordinal, are as follows: 


Model 3: .069 (p = .778), LOR(Cox) = .041 
Model 6: .181 (p = .507), LOR(Cox) = .109 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 74 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE R4. CLOUDY DAYS (N = 638 FOURTH- AND FIFTH-GRADES STUDENTS) 


Pode | Mose | odet3> | Models | Models | Model 


Fixed effects 


0.015 (0.025) 0.036 (0.023) 0.032 (0.027) 0.024 (0.026) 0.029 (0.025) 0.032 (0.030) 


Treatment 


p =.532 p=.131 p =.235 p =.350 p =.244 p =.280 

Pretests x x x x 
Other Covariates x x 
Matched Pairs x x x 
Standardized ES 0.048 0.116 0.103 0.077 0.094 0.103 
Random effects 

Pair Oh Os 0 

School © O° O° O* O* Os i 


.101 (.005) .087 (.005) .081 (.005) .097 (.005) .084 (.005) .078 (.004) 


Student p <.001 p <.001 p <.001 p <.001 p <.001 p <.001 


Note. Performance on this item was rated in terms of eight ordinal categories (zero and blanks were combined). Quantities in 
parentheses are standard errors; All effect sizes are regression adjusted impact estimates divided by the control sd for the outcome 


olTjag] ol nfo lain colmaal- Meo] ahice) mel coll] op 


a We also evaluated whether impact varies across grades for Model 3 and 6. We observed no differential effect across for Model 3 (p 
= .653) or for Model 6 (p = .711). 


a at=Wyoiavole) Mismsal-mUlalialaclalre(elaalr4qxe! 


¢ Zero effect estimate with no p value indicates that estimation met boundary condition for quantity. This often indicates that the 
quantity is trivially different from zero. 


The difference between treatment and control in the cumulative log odds of a higher-rated response 
for models like 3 and 6, but where we modeled student responses as ordinal, are as follows: 


Model 3: .306 (p = .156), LOR(Cox) = .185 
Model 6: .321 (p = .202), LOR(Cox) = .194 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX /3 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE R5. PEA SEEDS (N = 260 FIFTH GRADE STUDENTS) 


-_ ssaneencinee — = — 


-0.027 (0.029) -0.002 (0.026) -0.004 (0.031) 0.004 (0.029) 0.012 (0.027) 0.004 (0.037) 
p= .347 p= 932 p= 891 p = .886 p= .653 p=.919 


Xx Xx Xx Xx 
Xx Xx 
Xx Xx Xx 


-0.017 0.017 


O° Oh 


.0007 (.002) 
p= .356 


.048 (.005) 041 (.004) .036 (.003) O42 (.004) .038 (.003) .032 (.003) 
p< .001 p< .001 p< .001 p< 001 p< 001 p< .001 


oe GO? OF Oe OF 


Note. Performance on this item was rated in terms of four ordinal categories (zero and blanks were combined). Quantities in 
parentheses are standard errors; All effect sizes are regression adjusted impact estimates divided by the control sd for the outcome 
olijaglolUinfolamniolmaal-Maolalice) me |cole| oy 


ci Maloucyoiatolo) Mm manalomUlaliaie-larelelaalr4qxe! 


b Zero effect estimate with no p value indicates that estimation met boundary condition for quantity. This often indicates that the 


quantity is trivially different from zero. 


The difference between treatment and control in the cumulative log odds of a higher-rated response 
for models like 3 and 6, but where we modeled student responses as ordinal, are as follows: 


Model 3: .091 (p = .828), LOR(Cox) = .055 
Model 6: .403 (p = 584), LOR(Cox) = .244 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 76 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE R6. BOILING WATER (N = 187 FIFTH GRADE STUDENTS) 


— pusinnieanacaiannias — ined — 


-0.030 (0.062) = 0.011 (0.057) 0.062 (0.066) 0.012 (0.060) 0.033 (0.056) 0.090 (0.076) 
p= .626 p = .850 p= .354 p= 838 p= .558 p= .234 


Xx xX xX Xx 


X 


011 (.013) 009 (.008) 


p= .206 p=.145 

001013) é . g é " 
peace 0 0 0 0 0 

.161 (.019) 139 (015) 122 (012) .136 (.014) .118 (.012) O95 (.010) 


p< .001 p< .001 p< .001 p< .001 jo< 1001 p< .001 


Note. Performance on this item was rated in terms of three ordinal categories (zero and blanks were combined). Quantities in 
parentheses are standard errors; All effect sizes are regression adjusted impact estimates divided by the control sd for the outcome 


olijagl oleh ilolaincolmial-Meolaiice)me|ceol0| oy 


ia alomcXeiatolo)manal-mUlaliaia-lavelelaalr4qxe! 


b Zero effect estimate with no p value indicates that estimation met boundary condition for quantity. This often indicates that the 
quantity is trivially different from zero. 


The difference between treatment and control in the cumulative log odds of a higher-rated response 
for models like 3 and 6, but where we modeled student responses as ordinal, are as follows: 


Model 3: .337 (p = .421), LOR(Cox) = .204 
Model 6: 1.651 (p = .036), LOR(Cox) = 1.00 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX fi 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE R7. ICE CUBE (N = 195 FIFTH GRADE STUDENTS) 


aiiaiaheaanimruaiaaa sae _— 


0.031 -0.005 -0.003 


0.073 (0.065) 0.006 (0.062) = -0.023 (0.083) 


(0.078) (0.075) (0.083) 
p =.692 p =.994 p=.971 p=.262 p=212 p=.785 


Xx 


ene 


Of ou 


025 (.013) 025 (.012) 015 (.011) 


b b b 
p =.030 p=.019 p =.102 : : 
73-0020) .144 (.016) 129 (015) all COL) .139 (.014) HOF (OTA) 
p <.001 p <.001 p <.001 p <.001 p <.001 p <.001 


Note. Performance on this item was rated in terms of three ordinal categories (zero and blanks were combined). Quantities in 
parentheses are standard errors; All effect sizes are regression adjusted impact estimates divided by the control sd for the outcome 
olijial ele lite) ainie)mtar-mexelaiice) me |se10] oF 


ci alomcxeiatoro)Manal-mUlaliamia-lavelelaalr4qxe! 


b Zero effect estimate with no p value indicates that estimation met boundary condition for quantity. This often indicates that the 
quantity is trivially different from zero. 


The difference between treatment and control in the cumulative log odds of a higher-rated response 
for models like 3 and 6, but where we modeled student responses as ordinal, are as follows: 


Model 3: -.163 (p = .764), LOR(Cox) = -.100 
Model 6: -.267 (p = .740), LOR(Cox) = - .162 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 78 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


TABLE R8. MINERAL SCRATCH (N = 378 FOURTH- AND FIFTH-GRADE STUDENTS) 


po iodeit | aodet2 | Model a+ | models | todels | Model 


Fixed effects 


0.008 9.0040.037) 22924 = g 010 0.040) 0.013 (0.038) -0.017 (0.045) 
Treatment (0.040) - 910 (0.042) _ 793 _ 730 744 
p= 832 an p= 564 pe. as oT 

Pretests x Xx x x 
Other Covariates xX x 
Matched Pairs x x x 
Standardized ES -0.021 0.011 -0.063 0.026 0.034 -0.045 
Random effects 

7 007 (.004) 002 (.003) ee 

ial p = .058 p= .257 
School © O¢ O*¢ O*¢ O°¢ O°¢ O*¢ 
ee 140 (.011) 123 (.009) 110 (.008) 114 (.008) 084 (.005) 103 (.007) 
seas p< .001 p< .001 p< .001 p< .001 p< .001 p< .001 


Note. Performance on this item was rated in terms of three ordinal categories (zero and blanks were combined). Quantities in 
parentheses are standard errors; All effect sizes are regression adjusted impact estimates divided by the control sd for the outcome 


olijagl oleh ilolaincolmaal-Meolaiice) ecole] oy 


a We also evaluated whether impact varies across grades for Model 3 and 6. We observed no differential effect across for Model 3 (p 
aA Ae)s) ROLMKOLM\(exel-1 oN (olay els) F 


Il ot=WXoi afore) (smal mUl alimlaclalelelanlr4qxe! 


¢ Zero effect estimate with no p value indicates that estimation met boundary condition for quantity. This often indicates that the 
quantity is trivially different from zero. 


The difference between treatment and control in the cumulative log odds of a higher-rated response 
for models like 3 and 6, but where we modeled student responses as ordinal, are as follows: 


Model 3: -.143 (p = .568), LOR(Cox) = -.086 
Model 6: -.096 (p = .730), LOR(Cox) = -.049 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 7? 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


References 
Achieve, Next Gen Science Storylines & STEM Teaching Tools. (2016). Using Phenomena in NGSS - 
Designed Lessons and Units. STEM teaching tools. http://stemteachingtools.org/brief/42 


Ainley, M., & Ainley, J. (2011). Student engagement with science in early adolescence: The 
contribution of enjoyment to students’ continuing interest in learning about science. 
Contemporary Educational Psychology, 36(1), 4-12. 


Bandura, A., Barbaranelli, C., Caprara, G. V., & Pastorelli, C. (2001). Self-efficacy beliefs as shapers of 
children's aspirations and career trajectories. Child Development, 72(1), 187-206. 


Bloom, H. S., Hill, C. J., Black, A. B., and Lipsey, M. W. (2008). Performance Trajectories and 
Performance Gaps as Achievement Effect-Size Benchmarks for Educational Interventions. 
Journal of Research on Educational Effectiveness, 1(4), 289-328. 


Brahier, D. J., & Schaffner, M. (2004). The Effects of a Study-Group Process on the Implementation of 
Reform in Mathematics Education. School Science and Mathematics, 104(4), 170-178. 


Briscoe, C., & Peters, J. (1997). Teacher collaboration across and within schools: Supporting individual 
change in elementary science teaching. Science Education, 81(1), 51-65. 


Bryk, A. S. (2010). Organizing schools for improvement. Phi Delta Kappan, 91(7), 23-30. 
Cai, L., Thissen, D., & du Toit, S. H. C. (2011). IRTPRO: Flexible, multidimensional, multiple categorical 


IRT modeling [Computer software]. Lincolnwood, IL: Scientific Software International. 
Calvert, L. (2016). The power of teacher agency. The Learning Professional, 37(2), 51. 


Cameron, M., & Lovett, S. (2015). Sustaining the commitment and realising the potential of highly 
promising teachers. Teachers and Teaching, 21(2), 150-163. 


Carlson, J., & Daehler, K. R. (2019). The refined consensus model of pedagogical content knowledge 
in science education. In Repositioning pedagogical content knowledge in teachers’ knowledge for 
teaching science (pp. 77-92). Springer, Singapore. 


Casey, P., Dunlap, K., Brown, K., & Davison, M. (2012). Elementary principals’ role in science 
instruction. Administrative Issues Journal, 2(2), 10. 


Cavagnetto, A. R., Hand, B., & Premo, J. (2020). Supporting student agency in science. Theory Into 
Practice, 59(2), 128-138. 


Cervetti, G. N., Barber, J., Dorph, R., Pearson, P. D., & Goldschmidt, P. G. (2012). The impact of an 
integrated approach to science and literacy in elementary school classrooms. Journal of Research 
in Science Teaching, 49(5), 631-658. 


Darling-Hammond, L., & Richardson, N. (2009). Research review/teacher learning: What matters. 
Educational Leadership, 66(5), 46-53 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 80 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Elliott, S. N., & Bartlett, B. J. (2016). Opportunity to Learn. Oxford Handbooks Online. Oxford 
University Press. 


Evans, B. R. (2011). Content Knowledge, Attitudes, and Self-Efficacy in the Mathematics New York 
City Teaching Fellows (NYCTF) Program. School Science and Mathematics, 111(5), 225-235. 


Frederick, W. C., & Walberg, H. J. (1980). Learning as a function of time. The Journal of Educational 
Research, 73(4), 183-194. https://doi.org/10.1080/00220671.1980.10885233 


Graham, P. (2007). Improving teacher effectiveness through structured collaboration: A case study of 


a professional learning community. RMLE Online, 31(1), 1-17. 


Greenleaf, C. L., Litman, C., Hanson, T. L., Rosen, R., Boscardin, C. K., Herman, J., ... & Jones, B. 
(2011). Integrating literacy and science in biology: Teaching and learning impacts of reading 
apprenticeship professional development. American Educational Research Journal, 48(3), 647-717. 


Hallam, P. R., Smith, H. R., Hite, J. M., Hite, S. J., & Wilcox, B. R. (2015). Trust and collaboration in 
PLC teams: Teacher relationships, principal support, and collaborative benefits. NASSP 
Bulletin, 99(3), 193-216. 


Harrison, C., & Killion, J. (2007). Ten roles for teacher leaders. Educational Leadership, 65(1), 74. 


Hmelo-Silver, C. E., & Barrows, H. S. (2008). Facilitating collaborative knowledge building. Cognition 
and Instruction, 26(1), 48-94. 


Hill, H. C., Rowan, B., & Ball, D. L. (2005). Effects of teachers’ mathematical knowledge for teaching 
on student achievement. American Educational Research Journal, 42(2), 371-406. 


Iveland, A., Tyler, B., Britton, T., Nguyen, K., & Schneider, S. (2017). Administrators Matter in NGSS 
Implementation: How School and District Leaders Are Making Science Happen. WestEd. 


Jenkins, B. (2009). What it takes to be an instructional leader. Principal, 88(3), 34-37. 


Jones, M. G., & Carter, G. (2013). Science teacher attitudes and beliefs. In Handbook of research on 
science education (pp. 1081-1118). Routledge. 


Katzenmeyer, M., & Moller, G. (2009). Awakening the Sleeping Giant: Helping Teachers Develop as 
Leaders. Corwin Press. 


Kanter, D. E., & Konstantopoulos, 5. (2010). The impact of a project-based science curriculum on 
minority student achievement, attitudes, and careers: The effects of teacher content and 
pedagogical content knowledge and inquiry-based practices. Science Education, 94(5), 855-887. 


Murphy, C., Neil, P., & Beggs, J. (2007). Primary science teacher confidence revisited: Ten years on. 
Educational Research, 49(4), 415-430. 


McNeill, K. L., Katsh-Singer, R., & Pelletier, P. (2015). Assessing science practices: Moving your class 
along a continuum. Science Scope, 39(4), 21. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 81 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


National Research Council. (1996). National Science Education Standards. National Academic Press. 
https://doi.org/10.17226/4962. 


National Research Council. (2012). A Framework for K-12 Science Education: Practices, Crosscutting 
Concepts, and Core Ideas. National Academies Press. 


National Research Council (NRC). (2013). Next Generation Science Standards: For States, by States. 
National Academies Press. 


Pajares, M. F. (1992). Teachers’ beliefs and educational research: Cleaning up a messy construct. 
Review of Educational Research, 62(3), 307-332. 


Penuel, W. R., Harris, C. J., & DeBarger, A. H. (2014). Implementing the Next Generation Science 
Standards: Strategies for Educational Leaders. 


Rubie-Davies, C. (2009). Teacher expectations and labeling. In International handbook of research on 
teachers and teaching (pp. 695-707). Springer. 


Singer, J. D., & Willett, (2003). Applied longitudinal data analysis. New York: Oxford University Press. 


Supovitz, J. A., & Turner, H. M. (2000). The effects of professional development on science teaching 
practices and classroom culture. Journal of Research in Science Teaching: The Official Journal of the 
National Association for Research in Science Teaching, 37(9), 963-980. 


Tekkumru-Kisa, M., Kisa, Z., & Hiester, H. (2020). Intellectual work required of students in science 
classrooms: Students’ opportunities to learn science. Research in Science Education, 1-15. 


Tschannen-Moran, M., Hoy, A. W., & Hoy, W. K. (1998). Teacher efficacy: Its meaning and measure. 
Review of Educational Research, 68(2), 202-248 


Thapa, A., Cohen, J., Guffey, S., & Higgins-D’ Alessandro, A. (2013). A review of school climate 
research. Review of Educational Research, 83(3), 357-385. 


Tytler, R., & Osborne, J. (2012). Student attitudes and aspirations towards science. In Second 
international handbook of science education (pp. 597-625). Springer, Dordrecht. 


Wong, N., Heller, J. L., Kaskowitz, S. R., Burns, S., Limbach, J. O. (2020). Final Report of the Making 
Sense of Science and Literacy Implementation and Scale-up Studies. [U.S. Department of Education 
Project No. U411B140026]. Heller Research Associates. 


Wright, K. L., Franks, A. D., Kuo, L. J., McTigue, E. M., & Serrano, J. (2016). Both theory and practice: 
Science literacy instruction and theories of reading. International Journal of Science and 
Mathematics Education, 14(7), 1275-1292. 


Unlu, F., Bozzi, L., Layzer, C., Smith, A., Price, C., & Hurtig, R. (2010, October). Linking 
implementation fidelity to impacts in an RCT: A matching approach. In symposium: Using 
matching methods to analyze RCT impacts on program-related subgroups, Association for Public Policy 
Analysis and Management (Vol. 13). 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 82 


EFFECTIVENESS OF WESTED'S MAKING SENSE OF SCIENCE 


Urick, A., Wilson, A. S., Ford, T. G., Frick, W. C., & Wronowski, M. L. (2018). Testing a framework of 
math progress indicators for ESSA: How opportunity to learn and instructional leadership 
matter. Educational Administration Quarterly, 54(3), 396-438. 


AN EMPIRICAL EDUCATION RESEARCH REPORT APPENDIX 83 


